Success of a data warehouse implementation initiative is greatly affected by what data modeling technique is used and how it is implemented. Data modeling begins the process of structuring the data elements that make up a data warehouse.
In this process, requirements are analyzed and defined to support the objectives of the data warehouse. The results of this analysis are contained within conceptual, logical, and physical data models. Before we dig into the different types of data models, let’s review some key aspects of data modeling for a data warehouse.
Data modeling identifies definitions of entities in regard to their inheritance properties. This provides clear relationships and dependencies. For example, a result of inherent definition modeling would be a definition that a product can belong to multiple categories in a manufacturing data warehouse implementation.
Alignment through Logical Decomposition
Entities, relationships, and dependencies as well as metrics are modeled across subject areas of a system. Commonalities in modeling are identified and appropriated throughout systems. Modeling of each component of a data warehouse is aligned with a business requirement relevant to a particular subject. Thus, all models in a data warehouse are included in an overall data warehouse strategy parsed through initiatives and relevant projects.
Identifying Complexity, Priorities, and Needs for Simplification
Data modeling efforts become more complex as you define entities and the relationships between them. Incomplete data models lead to inadequate business rules, resulting in a sub-optimal solution. However, it’s also important to prioritize data modeling efforts in accordance with project milestones and deliverables.
Decisions about the importance of defining all possible attributes of a single entity versus defining just desired outputs and metrics of a data warehouse system will affect the overall result of data warehouse initiatives.
Eliminating Non-Critical, Isolated, and Unrelated Data
An overarching objective for a data warehouse is often to include all data that exists in an enterprise. However, it’s important to keep the focus on data as it relates to the overall scope of requirements and priorities. Failure to do so can result in extending timelines for deliverables, analysis paralysis, and eventual failure of a data warehouse.
Data Modeling Variables and Inputs
A sub-optimal data model affects the ROI of a data warehousing effort, including higher costs for systems, infrastructure, and business consumer discovery and access as well as additional resources for maintenance and modification. Business rules and requirements are inputs into data models, and this dependency translates into data model changes when business rules and requirements change.
Data Modeling Overview
A data warehouse data model includes definitions of entity types, which are typically grouped into dimensions, reference tables, hierarchies, fact tables, and bridge tables. This section will cover each of these elements. In addition, data modeling includes the classifications of attributes, measures, relationships, and integrity rules.
The process of data modeling begins with the translation of business requirements into conceptual models, which include data definitions.
Following the conceptual model is a logical model that outlines the actual implementation. The logical data model identifies both data objects and the relationships between them. One conceptual model often yields multiple logical models, which are referred to as subject or context models.
The physical model rounds out the data modeling process and includes additional physical information, such as physical data types, referential integrity, constraints, and partitions.
Data warehouse teams are faced with several options when implementing data models:
The team can work in a serial order, where the data modeling team completes the entire model before moving to the next phase (i.e., conceptual model first, then logical, and finally physical).
Or the data modeling team can work by subject area, iterating through the conceptual, logical, and physical models for one subject area before moving on to the next one.
Alternatively, the team can perform a parallel effort, in which different areas within the data model are in different development stages (i.e., conceptual, logical, or physical).
The parallel effort typically involves enterprise data modeling tools and requires the team to traverse between subject areas as well as to move between the conceptual, logical, and physical data model within a subject area until the overall model is complete. However, the reality of data warehouses is that data models change and subject areas are continuously added. This reality frequently translates into a parallel or continuous data modeling effort.
For example, in a recent data warehouse project, an organization realized it needed to add support for analyzing data warehouse measures in respect to multiple time zones. The data warehouse had been in production for over a year when the request came in from business stakeholders, but this added requirement was part of a never-ending need to support changes in how the business operated.
This is a very common scenario, and it should come as no surprise to data warehouse developers that they will have to be agile and creative in combining conceptual, logical, and physical data modeling efforts to support new and changing business needs.
The following sections provide more detail on the purpose of the different data models and the activities that take place during each stage.
Major characteristics of a conceptual data model relate to the following points:
Reduce the Enterprise Highest-Level Nouns
Creating one common definition for common entities reduces ambiguity when comparing data and results across different business and organizational units. Examples of these efforts are consolidated definitions for Customer, Product, and Inventory entities.
These nouns may be referred to as domains, master objects, or another abstract term. For example, the Product domain in physical terms will include numerous tables, but in the conceptual phase, it is identified as a single abstract term.
Limit the Conceptual Model to 1 Page, or Less than 8 Objects
The purpose of the conceptual model is to provide a high-level understanding of the major entities and business processes in scope for the data warehouse. The general rule is that if the conceptual model can’t fit on a single page, it’s too granular. Figure 2-38 is an example of a high-level conceptual model.
Figure 2-38: Conceptual data model
Subject or Context Model
The context or subject data model defines intersections of conceptual data elements with subject areas in a data warehouse. Here are the main characteristics of a subject data model:
The subject model presents the highest-level functions of the enterprise (Sales, Marketing, Finance, IT).
Some high-level concepts are also subjects (Finance).
The grid of concepts by subject should show that most subjects need most concepts.
The subject model should anticipate the design of data marts or reporting groups.
Figure 2-39 shows an example of a subject model represented via a reverse-engineered bus matrix of dimension usage in SSAS.
Figure 2-39: Subject data model
The logical data model deepens the analysis of data elements in scope of a data warehouse effort and includes more detail about the data.
Included in the logical model are:
Entities, attributes, and domains – The logical model includes more granular information than the conceptual model, defining logical attribute names and domains in which entities belong.
Normalization/denormalization and relationships – Primary and foreign key relationships are identified as well as objects containing various levels of hierarchies and entity relationships.
Advanced concepts – The following concepts are also addressed in the logical model: sub-typing (inheritance), one-to-many relationships (rather than 0-to-many), null meaning and operations, and row interdependence (chaining in history).
For example, attributes for product type, product category, product sub-category, and product are defined as well as relationships between these hierarchies. In transactional systems, data is normalized and definitions of primary and foreign keys as well as relevant relationships are included in the model. In analytical systems, data can be denormalized, and in this case, logical models will also include definitions of business keys on dimensions and composites for surrogate keys on facts.
Figure 2-40 is an example of a partial logical model for the larger denormalized model.
Figure 2-40: Partial logical data model
The physical model adds final detail to the modeling effort in respect to column data types, nullability, primary keys, indexes, statistics, and other relevant table properties. The diagram in Figure 2-41 expands on the logical model from above, introducing information related to the physical modeling level of detail.
Figure 2-41: Partial physical data model
In physical modeling, it is important to properly manage data definition language (DDL) statements for all tables, functions, stored procedures, and other relevant database objects.
When changes to database objects are properly managed, switching between versions of changes to metadata becomes more feasible. This is sometimes required in the normal course of development and introduction of new features.
Having DDL properly managed also provides for more encompassing disaster recovery procedures and enables recreating data warehouse structures in additional development, testing, or user acceptance testing (UAT) environments.
An effective method for managing DDL is to have scripts versioned and controlled via version control software such as Microsoft Team Foundation Server.
Figure 2-42 shows the DDL for the DimDate table used in the previous diagram.
Figure 2-42: DimDate DDL
Data Modeling – Column Types
So far, our focus has been at the table level. However, this section covers different column types defined and used during data modeling.
A surrogate key, shown in Figure 2-43, is a unique identifier for records within a table. Surrogate keys are either generated by the database when the record is inserted into a table or by a data integration application prior to the record insertion. SQL Server automatically generates the surrogate key when the column is created with the IDENTITY attribute. Note that PDW does not support the IDENTITY attribute in its initial release.
Surrogate keys are at the core of every data warehouse model and are typically integer columns that provide a pointer to a unique instance of a dimension member defined by its natural key. Surrogate keys are used to keep fact tables as narrow as possible, to create effective indexes on dimension tables, and to support Type 2 dimensions. In addition, surrogate keys replace the need for including a source system identifier (along with the natural key) for each table that merges data from multiple source systems.
Surrogate keys are typically designed as sequentially incrementing integers and have no logical correlation to a natural key (defined below). In special cases, such as in the Date dimension, the surrogate key is an integer representation of the canonical date value (e.g., May 1, 2009, has a surrogate key of 20090501). This type of surrogate key is also referred to as an intelligent key. Although intelligent keys are not generally recommended, they are acceptable in this case because the Julian calendar is a stable entity not subject to change.
The SQL Server data type for a surrogate key is typically an int because this data type’s maximum value (2,147,483,647) is larger than most tables’ projected cardinality. However, use of tinyint and smallint for small dimensions, as well as use of bigint for very large dimensions, is relevant depending on the projected table cardinality over the life of the data warehouse. The SQL Server data type section below contains more information about integer data types, including their minimum and maximum values.
Data modeling also includes creating a record containing a default surrogate key used to represent null or not available records. Data integration processes will use the default surrogate key (e.g., 0) instead of a NULL value when a lookup doesn’t produce a match.
The natural key, also known as a business ID, is one or more attributes (columns) within a source system used to uniquely identify an entity. Figure 2-44 shows the natural key (ProductLevel1Cd) for an AdventureWorks product.
Figure 2-44: Natural key
Note that a natural key is used to uniquely identify one product, while the surrogate key is used to uniquely identify one instance of that product over time. Also note in the above example that the SourceSystemKey may also be part of the natural key when there’s the risk of duplicated product codes from different source systems.
The data warehouse data model needs to be flexible to account for changes in uniqueness of dimension records as business rules change over time. However, the natural key is one value that must never change over time, and when it does, it’s considered a new item (e.g., a new product SKU).
An attribute (or column) adds descriptive characteristics to a table. Attributes exist for all types of tables within a data model. Figure 2-45 shows some attributes for the AdventureWorks DimVendor table.
Figure 2-45: Attributes for DimVendor table
Measures or Metrics
A measure, or metric, is a numeric column typically used to store transaction amounts, counts, quantities, and ratios. Figure 2-46 shows some measures from the AdventureWorksDW2008 FactSalesDetail table.
Measures are classified as either base or calculated. One or more base measures are used when resolving the value of a calculated measure. Decisions about which layer of a data warehouse architecture hosts measure calculations are defined during data modeling sessions. These decisions are determined based on whether calculations are relevant to all subject areas in a data warehouse or not.
Measures can also be categorized by how they are calculated:
Additive – These measures aggregate across all dimensions with no special provisions. Examples of additive measures are the OrderQuantity and SalesAmount measures found in the AdventureWorksDW2008 fact tables.
Semi-additive – These measures can be aggregated only over specific dimensions. Examples of semi-additive measures are account balances and inventory levels. These measures can be aggregated by some of the dimensions. But if account balances, for example, are summed up over 1 year for one customer, the resulting sum of account balance snapshots would be inaccurate.
Non-aggregated measures – Ratios and percentages are examples of non-aggregated measures. These measures can’t be aggregated over any dimension.
Dates and Time
Dates play a central role within a data warehouse. Time does as well, but usually to a lesser degree. Date and time are special entities in a data warehouse and, as such, require particular attention in data modeling efforts. Dates and time are typically stored in the SQL Server datetime or smalldatetime data types and can represent:
Time of day (e.g., May 10, 2010, 10:30am EST)
Date (e.g., May 10, 2010)
Time within a day (e.g., 10:30am EST)
Besides including intelligent surrogate keys for these dimensions, as described in the section on surrogate keys above, organizations often use date and time for “to date” calculations.
Whether to keep date and time in a single dimension or separate them is one of the questions data modeling touches on. Separating these two dimensions provides for greater manageability and usability in a data warehouse. In this approach, two surrogate keys are derived from one source column. For example, a source column for a transaction date of ‘2010-01-01 14:01:00’ would yield a surrogate key for the Date dimension based on the ‘2010-01-01’ segment, while a surrogate key for the Time dimension would be derived from the ’14:01:00’ part of this source column.
Date and time are often used as role-playing dimensions, so one physical date dimension will logically be modeled to reference invoice date and transaction date, as well as create date and other dates in a data warehouse.
Also note that date dimensions differ across vertical industries—for example, a financial date dimension will differ from a retail date dimension.
Figure 2-47 shows an example of both a Date and a Time dimension.
Figure 2-47: Date and Time dimensions
Support columns are included in data models in support of tracking values over time. These columns allow for more efficient management of state changes for dimensions and for more effective auditing of data loads. Figure 2-48 shows some instrumentation support columns.
Figure 2-48: Support columns
The following columns are used to track different historical versions:
StartDate – The date when this record version become active.
EndDate – The date when this record became inactive. The current record will either have a NULL value or a value representing the maximum date (e.g., 12/31/9999).
RecordStatus – The status of this record. Typical values are Active, InActive and Pending.
VersionId – This value represents the version number of the record and is incremented every time a version of the record is created.
InferredMemberInd – This value indicates if a dimension member was loaded during fact load as an inferred member. Inferred members are dimension members that don’t have attributes other than business ID available during the time facts are being loaded (a lookup for surrogate key during fact load yields no results).
The other support column types are in support of data integration instrumentation. LineageId is an example of this kind of column and contains a value that lets data stewards track the process that loaded the record. Another example is InferredMemberInd, which as we just saw is a flag indicating whether a record is an inferred member or not.
All of these columns are covered in more detail in Chapter 3 – Data integration.
Data modelers also address other types of columns as they are relevant to business requirements. These columns include spatial data, XML, large text, and images.
Data models include definitions on keys as well as changes to keys as data changes custody in data warehouse layers. Typically, data models account for introducing new keys for each change in data custody. These keys are implemented by using either the natural key from the source system or the surrogate key created during the data integration process.
The blanket recommendation is for data warehouse data models to use surrogate keys for all primary/foreign key activity. The reasons for doing so include:
More efficient representation – Natural keys in source systems are often defined using character data types, which are less efficient than integer data types in SQL Server. Additionally natural keys may be represented as different values in different source systems, making it challenging to consolidate. Instead, use a surrogate key.
Tracking changes across history – The need to report across history requires the data model to store the instance of that record at a particular point in time.
Natural keys are often used in data warehouses to aggregate multiple versions of one entity across all instances of this entity (i.e., one natural key for more than one surrogate key).
Additional benefits of proper key management in data warehouses include:
Enabling the conceptual and enterprise model – Failing to properly identify natural keys or to efficiently manage lookups of natural keys for surrogates negates the basic principles of data modeling. Key management is one of the essential factors for enabling accurate and effective conceptual and logical models.
Preparing for the future – Managing natural and surrogate keys prepares the data warehouse for downstream data consumption. OLAP structures and BI semantic layer components perform more efficiently and provide for greater data analysis when data models include proper handling of keys.
Once a decision has been made to use surrogate keys, the next question becomes:
How do we generate surrogate keys?
The short answer is that the SQL Server data modeler and developer can use globally unique identifiers (GUIDs), IDENTITY columns, or a key generation utility. GUIDs usually are not used due to their size (16 bytes) and their randomly generated nature. For more information about surrogate key generation, see the Surrogate Key section in Chapter 3 – Data Integration.
Dimension is a term most commonly associated with Ralph Kimball. In his white paper Facts and Fables about Dimensional Modeling, Kimball attributes the first reference of facts and dimensions to a joint research project conducted by General Mills and Dartmouth University in the 1960s.
Wikipedia states that:
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
Dimensions are often modeled based on multiple entities from one or more source systems. Depending on reporting needs, consumers of a data warehouse system will be accessing information in entities for operational reporting, while analytical reporting will rely on dimensions.
Traditionally, flattened, denormalized structures are the most efficient data model technique because they require the least amount of joins to produce the requested result. However, in a few cases, a dimension can become too wide, and data modelers need to opt for a more normalized structure.
For more information about dimensions, a good starting point is Kimball’s articles at his Web site.
Let’s spend the rest of this section getting an overview of the most common dimension types and concepts:
Star dimension – A star dimension is a fully denormalized version of a dimension, where all hierarchies and reference values are merged into the dimension. Star dimension tables are directly linked to the fact table.
Snowflake dimension – Snowflake dimensions contain more than one dimensional table to include references to all relevant attributes. The AdventureWorks Product Category hierarchy, shown in Figure 2-49, is an example of a snowflake dimension. In a snowflake dimension, some of the tables may be indirectly linked to the fact table.
Figure 2-49: Snowflake schema example
Parent-child dimension – This dimension type is used to model hierarchies such as an organizational chart or a chart of accounts. These hierarchies are unbalanced and ragged, which makes them difficult to model using a snowflake technique. The AdventureWorks2008DW DimEmployee table, shown in Figure 2-50, is an example of a parent-child hierarchy.
Figure 2-50: Parent-child dimension example
Role-playing dimension – A role-playing dimension occurs when one dimension is linked to multiple times within one table. The most common example of this is kind of dimension is Date, where one dimension is used to provide information on order date, invoice date, create date, and other dates, as shown in Figure 2-51.
Figure 2-51: Role-playing dimensions
Junk dimension – Junk dimensions represent a collection of low-cardinality, non-related attributes contained within one dimension, where each possible combination of attributes is represented with a single surrogate key. This design decision is purely for optimization—one 2-byte junk dimension key is more efficient than five 1-byte keys, which results in significant savings in storage space for fact tables with billions of rows or more.
Degenerate dimension—A degenerate dimension does not have its own table; it is represented by its value within the fact table. This typically occurs in transaction tables, and examples are order number and invoice number. Degenerate dimensions are useful to capture the transaction number or natural primary key of the fact table. Because it does not make sense to create a dimension with no attributes, the attributes instead may be directly stored in the fact table.
Figure 2-52 shows two examples of degenerate dimensions within the AdventureWorksDW2008 FactInternetSales table.
Figure 2-52: Degenerate dimensions
Accurately reporting historical results often requires that the state of a dimension at a particular point in time be recorded and saved, as opposed to being overwritten. There are basically three methods for tracking history:
Update one record – There’s one version of a record, and all changes are applied to this one record. This approach is often referred to as a Type I Slowly Changing Dimension (or SCD I).
Track record changes – Every change in a record will result in a new version of that record with a unique surrogate key. This is often referred to as a Type II SCD (or SCD II).
Add new columns – Every change in a key value results in a new column added to the table. This is often referred to as a SCD III.
It’s often not practical to add columns, so the options come down to two data integration patterns: versioned inserts and versioned updates. Each of these is covered in more detail in Chapter 3 – Data Integration.
Now that dimensions have been covered, the next topic discusses fact tables, which typically model a business transaction or business process.
The largest tables within source systems are the transaction tables. These tables are often orders of magnitude larger than dimensions. These tables are modeled as fact tables within a dimensional data model. Another common source for fact tables is business processes.
Fact tables are optimized structures that are typically comprised of numbers, dates, and very small character columns. The numbers are divided into numeric and monetary values and foreign key references to dimensions. Classification of facts can be done by fact table types and categories.
Fact table categories were introduced in the Column Types section above and include additive, semi-additive, and non-additive measures. Another category is a custom rollup fact table, where the aggregation rules are specific to a dimensional value or values. The most common example of this is a financial chart of accounts.
Fact table types include:
Transactional – This is the most common type of fact table. The classic example is entering one product sale at a store. Transaction facts are usually additive—that is, SQL aggregate functions such as SUM, MIN, MAX, and COUNT can be applied to the measures.
Snapshot – The facts within a snapshot fact table are not additive across one or more dimensions, typically the Date dimension. The classic example is inventory, where the measures represent values at a point in time. Inventory is not additive across time, but is additive across other dimensions referenced by the snapshot fact table. Other examples include event booking levels and chart of account balance levels.
Fact less – This table has no measured facts—row counts are the only measure. This is typically used to describe events, such as something that has or has not happened, or many-to-many relationships like coverage models or to measure that something has or has not happened. It contains only dimensions or one fact with a value of 0 or 1. Common examples include class attendance, event tracking, coverage tables, promotions, or campaign facts. Ralph Kimball used Voyages as the example in his seminal book, The Data Warehousing Toolkit.
Aggregate – Aggregate fact tables include information rolled up at a certain hierarchy level. These tables are typically created as a performance optimization in support of a business reporting requirement. Deriving an aggregation table once in a data integration process is much more efficient than aggregating every time the information is requested. Aggregations need to account for the additive nature of the measures, created on the fly or by pre-aggregation.
Reference data was described earlier in this chapter as a simple classification of a noun or transaction. These classifications play a key role in data warehouse reporting and analytics by providing the ability to filter and/or aggregate results. Reference data is considered a category of master data.
The DimScenario reference table, shown in Table 2-1, is one example of reference data within the AdventureWorks samples.
This table is a common reference table found in budgeting and forecasting software and allows the filtering of fact tables. Note that the above example isn’t truly representative because many reference data tables consist of a character code and description. For example, a State reference table would have state code values and description columns, as shown in Table 2-2.
Table 2-2: Example State reference table
In the above example, the data modeler decides whether to add a surrogate key, as shown in Table 2-3, or to use the natural key.
Table 2-3: Example State reference table with surrogate key
The general data modeling rule is that adding a surrogate key rarely has a negative impact. The more relevant question is whether to use the surrogate key (StateKey) or the natural key (StateCode) as the foreign key. The general rule of thumb is to always use the surrogate key, but the downside is that:
There will always be a need to join to the reference table.
Reference tables with versioned inserts reference an instance of a reference table entry.
Data modelers often include the natural key and description in a denormalized table when modeling hierarchies, as shown in Table 2-4.
Table 2-4: Denormalized City table
Merging Reference Data
The data warehouse team will often need to merge multiple reference data tables into one master reference table. This is required when there are multiple source systems, each with their own set of reference tables and codes. In addition, organization-specific reference data frequently needs to be merged with industry-standard data because the standard either does not fully meet the organization’s needs or the reference data existed prior to the standard. This is a common occurrence in the health care industry, where codes used in subject areas such as clinical analysis are organization-specific.
Creating a single version of the truth for reference tables and hierarchies is a significant portion of MDM efforts. Merging different clinical codes may require the creation of a new natural key, which is represented by two or more different natural keys in their respective source systems.
Note that using a surrogate key as the natural key could result in an issue if the reference table supports versioned inserts. What seems like a very simple issue can turn into something much more complex.
Although adding a surrogate key doesn’t hurt, determining whether to use the surrogate key or the natural key for foreign key references is a function of:
Whether the natural key will ever change
Whether the objective is to reference a versioned instance of the record or the record regardless of version
This decision becomes more important when modeling hierarchies, which are often comprised of multiple levels, each with their own reference table.
A hierarchy is a multi-level structure in which each member is at one level within the hierarchy. It is a logical structure of dimension attributes that uses order levels to define data aggregations and end user navigational drill paths.
Each hierarchy member has zero or one parent and zero or more children. Figure 2-53 shows an example of a date hierarchy.
Figure 2-53: Date hierarchy
The leaf level is the lowest level in the hierarchy, and each member within this leaf is connected to one parent. This structure is repeated for each level above the leaf level, until the top level is reached.
Multiple hierarchies can be created from one source. Figure 2-54 shows an example of this: The AdventureWorks2008DW DimDate table is displayed on the left, and the multiple hierarchies that can be created from this one table are displayed on the right. (Note that this SSAS example was used to visually show multiple hierarchies; SSAS is not required when creating multiple hierarchies within a data warehouse.)
Figure 2-54: DimDate table hierarchies
Hierarchies are a large topic, and this chapter covers balanced, ragged, unbalanced, and network hierarchies.
Figure 2-54 above is an example of a balanced hierarchy. All branches of a balanced hierarchy descend to the same level, with each member’s parent being at the level immediately above the member. Balanced hierarchies can be collapsed in one table, as in the above example, or exist in multiple tables, as with the Product Category hierarchy shown in Figure 2-49. In a standard balanced hierarchy, each level of the hierarchy is stored in one and only one column of the dimension table.
In a ragged hierarchy, the parent of a member can come from any level above the level of the member, not just from the level immediately above. This type of hierarchy can also be referred to as ragged-balanced because levels still exist. A Geography dimension is a typical example of such a hierarchy. In some countries, the province/state level may not exist, as with the Republic of Singapore or Vatican City State, for example.
Unbalanced hierarchies include levels that have a consistent parent-child relationship but have logically inconsistent levels. The hierarchy branches can also have inconsistent depths. An example of an unbalanced hierarchy is an organization chart, which shows reporting relationships among employees in an organization. The levels within the organizational structure are unbalanced, with some branches in the hierarchy having more levels than others. The AdventureWorksDW2008 DimEmployee dimension table is an example of an unbalanced hierarchy.
Hierarchies and History
Hierarchies change over time, which requires the data warehouse to display the correct hierarchy based on the particular date range or period in time. One example is reporting on sales based on the Sales Territory hierarchy. Figure 2-55 shows the Northeast sales territory in 2009 and 2010. This hierarchy has three levels:
Region( Mid-Atlantic, New England)
Figure 2-55: Northeast sales territory
As you can see, sales management changed the sales territories in 2010 and subdivided New York and Connecticut as follows:
Created a new district, Upstate NY, and assigned it to the New England region. The remainder of New York stays within the New York district.
Divided Connecticut into two districts:
North of Hartford is now part of the New England region.
Hartford and South is part of the Mid-Atlantic region.
However, the business consumer may still want to see 2009 Northeast sales totals based on:
How the sales territory was organized in 2009
The current sales territory structure (i.e., the 2010 sales territory)
Modeling a Versioned Balanced Hierarchy
A hierarchy can be modeled in several ways, depending on whether the hierarchy is balanced or not. When the hierarchy is balanced, it can be represented by multiple reference tables, where each reference table is used to populate one level in the hierarchy.
This can be modeled either by each level containing a foreign key reference to its parent or in a flattened structure, as shown in Table 2-5. This table shows how the changed sales territory information presented above is stored within a denormalized Sales Territory table.
Note that this table could also be modeled in a normalized structure, using multiple reference tables with foreign key references to their parent. However, each of the reference tables would need to be versioned to account for the sales territory versioned changes over time.
Table 2-5: Sales territory structure
Note the following about the Sales Territory table:
Sales Division, Region, and District all are separate reference tables. Their natural key is used to model the sales territory.
Using the natural key, as opposed to the surrogate key, avoids an issue of returning different results from reference tables that support versioned inserts.
The version of the hierarchy is a key attribute in this example. Many hierarchical structures such as a financial chart of accounts or a sales territory are created as one instance or version, as opposed to treating each separate change as its own version.
The next consideration is how dimension members that are children of the Sales Territory table (such as Sales Rep and Company of both Suppliers and Customers) should be modeled. Figure 2-56 shows the options available to data modelers.
Figure 2-56: Hierarchy modeling options
Data modeling options include:
Store the surrogate key of the Sales Territory instance.
Store the natural key of the sales territory.
Denormalize the Sales Territory into the Company table.
Add to the hierarchy a Company level that contains the Company surrogate and natural keys. Every version of the hierarchy will include all companies.
Options #1 and #3 would both require a different version for every Sales Rep record each time the sales territory changed. Option #2 would not require an extra version of the Sales Rep record, but would require the query to have a version filter to return the desired version of the sales territory. Option #4 would require the creation of one Sales Rep record for every version of the sales territory.
Modeling a Time-Dependent Balanced Hierarchy
Other hierarchies change based on events as opposed to a versioned release. One example of this scenario is a company linking structure that links subsidiaries and legal entities to their parent company. Figure 2-57 shows a simple example with two parent companies each consisting of two companies. The time-dependent event would be Holding Company B’s acquisition of Company A.
Figure 2-57: Company linking example
Table 2-6 shows how this two-level hierarchy could be modeled.
Table 2-6: Modeling a company linking hierarchy
Notice that in this example, the changes are in response to an independent event instead of a scheduled version release. Any query that aggregates or filters on the parent company could return different results when you apply an “as of” date to the Start and End date ranges.
Ragged and Unbalanced Hierarchies
As discussed earlier, this class of hierarchy will have different levels and will be unbalanced. Organizational charts and a financial chart of accounts are typical examples of ragged, unbalanced hierarchies. These hierarchies are often modeled as parent-child structures. The DimEmployee dimension illustrated in Figure 2-43 above is a good example of this.
Although the parent-child structure makes it easy to store these values, reporting becomes more difficult because the table is not level-centric—meaning there isn’t one table for each level in the hierarchy. When reporting, the hierarchy levels become more important because aggregation is frequently based on a level number.
The options in this case are:
Create a balanced hierarchy
Transform to a balanced hierarchy within SQL Server
Creating a balanced hierarchy from a ragged, unbalanced hierarchy is typically done by following these steps:
Create a balanced hierarchy table, with the code and description columns for the maximum level within the ragged hierarchy.
Populate this balanced hierarchy table starting with the parent and descending down through the hierarchy.
Repeat parent values in the child level if the number of levels within a branch is less than the maximum number of levels.
Using Common Table Expressions
In SQL Server, Common Table Expressions (CTEs) are a powerful tool for querying hierarchies. For example, Figure 2-58 shows a CTE for querying the HumanResources.Employee hierarchy in the AdventureWorks sample database.
Figure 2-58: Common Table Expression example
Results of this CTE include manager and employee IDs, relevant titles, and levels within the organization, as Figure 2-59 illustrates.
Figure 2-59: Common Table Expression results
For more information about SQL Server CTEs, see WITH common_table_expression (Transact-SQL). Note that CTEs are not supported in the initial release of PDW.
Also note that in a network hierarchy, nodes can contain more than one parent. A common example of a network hierarchy is a family tree.
Which Hierarchy-Handling Option Is Most Effective?
In data warehouse architecture, the most effective design option to handle hierarchies has proven to be flattening hierarchies into a single table. Also, if there is more than one hierarchy defined for a dimension, all hierarchies should be included in the one table. This approach eliminates joins between the main dimension table and lookup tables, improving data retrieval, which is what data warehouse systems are built for.
Finally, in parent-child hierarchies, for smaller dimensions, you can use a typical ID-ParentID recursive approach. But for larger dimensions, this technique can have significant performance issues. A recommended strategy in these cases is to introduce many-to-many fact(less) tables. This approach works well with extending data models to OLAP, especially with SSAS.
Bridge tables hold values for multiple instances of relationships between entities. These containers for storing many-to-many relationships are also referred to as junction or cross-reference tables.
One of the examples for using bridge tables is in handling security privileges for users in a data warehouse. Figure 2-60 depicts this scenario, with the UserSecurity table serving as a bridge table.
Figure 2-60: Bridge table example
Nulls and Missing Values
This section outlines guidelines for handling nulls and missing values in respect to architecting effective data warehouse systems.
Nulls are not recommended for attribute values because they provide no meaning for analysis. The existence of nulls also translates into more complex queries (i.e., the existence of nulls must be checked in addition to comparing values). In addition, in respect to business ID, nulls need to be handled before they reach data warehouse tables.
Business rules implemented in data integration processes need to include handlers for nulls and process them based on what type of attributes they are assigned to. Typical rules for handling null values for dimension attributes include replacing code values with ‘N/A’ and replacing descriptive values with ‘Unknown’. Of course, data integration processing of null values must be consistent across the data warehouse; otherwise, data will be subjected to various interpretations.
Finally, if business ID columns include nulls, depending on relevant business rules, corresponding records can be prevented from loading to the data warehouse by loading them into exception tables. Alternatively, they can be loaded under an Unknown dimension member.
For measures, the decision to persist null values is sometimes a function of the downstream consumers. For example, OLAP engines such as SSAS that support sparse cubes typically perform much better when null values are used for fact table measures used to populate the OLAP database.
Missing Values (Empty Strings)
Empty strings for attributes usually don’t translate into an asset for a data warehouse. Preserving empty strings can lead to inconsistent reporting and confusion for information workers if an empty string in a source system is equivalent to a null value. Business consumers should have input into whether null values and empty strings can both be replaced with the same default. If so, data integration business rules can replace missing values and null values with the same default value.
Referential integrity defines relationships between entities as stipulated by business rules in the logical data warehouse model. Enforcing referential integrity is a core element of data quality. Traditionally, the best practice in data modeling for preserving referential integrity was to create a foreign key constraint.
However, this best practice can cause performance issues when applied to large tables with a significant number of foreign keys. The reason is that SQL Server enforces referential integrity by first verifying that a record exists in the referenced table prior to the insert. This is a consideration for most SQL Server data warehouses because many fact tables have a lot of foreign keys and very large table sizes.
The alternative is to enforce referential integrity within the data integration processes rather than creating foreign key constraints. This is a best practice in large SQL Server data warehouse implementations and applies to both the staging and data warehouse data stores.
Clustered vs. Heap
One of the important physical design decisions for SQL Server tables is whether a table should be modeled with a clustered index or as a heap. SQL Server clustered index tables have one index that stores and sorts the data rows based on the index key values. Tables without this feature are referred to as heap tables. Figure 2-61, from SQL Server 2008 R2 Books Online, illustrates the difference between heaps and clustered tables at a physical level.
Figure 2-61: Heap vs. clustered table on disk structure
The leaf level of a clustered table index is the data page, which is organized by the clustered index key. Heap tables are organized by their insert order—each insert appends the record to the last data page.
Table 2-7 compares clustered and heap tables with no indexes across storage and access options.
Data is stored in order insert
Data is stored by the clustered key
Index if request is by clustered key
Inserts occur at the end of table
Inserts occur by the clustered key and can result in fragmentation
Updates can result in page splits
Updates can result in page splits
Additional overhead for both disk space and time to manage the index
Less need for defragmentation
Defragmentation required due to clustered key inserts
Table 2-7: Heap vs. clustered table comparison
The general rule for SQL Server data modelers is that clustered tables provide better performance when the clustered key(s) is commonly used for query operations. This is typically the case for dimensional data models.
The MSDN SQL Server best practices article Clustered Indexes and Heaps recommends that a clustered index always be created for a table. However, data modelers should recognize that creating a clustered index on every table will result in additional load times for very large fact tables. That additional load time may be unacceptable if it results in the data integration processes’ execution times overlapping with the time the data warehouse is available to business consumers.
Heap tables with no indexes are necessary in VLDW scenarios. In this scenario, data is optimized for loading data. Data is inserted into heap tables, and updates are never applied due to performance reasons. Updates are expensive for very large tables.
Heap tables are a consideration when large tables are not directly accessed by business consumers, such as with the data warehouse database within a hub-and-spoke architecture. In this scenario, the fact tables within the data mart spokes should be modeled as clustered unless VLDW performance considerations prohibit this.
Here are some implementation notes to keep in mind about clustered indexes:
For clustered tables, the data integration process should sort the source data by the clustered index key, which in most cases will translate into records being inserted at the end of the table to reduce fragmentation.
These indexes may need to be maintained on a regular basis because clustered indexes can suffer from fragmentation depending on the extent of insert and update activity. Fragmentation in clustered tables can easily be resolved by rebuilding or reorganizing the clustered index. However, this can be a very lengthy operation, especially for fact tables with cardinality in the billions of rows.
It’s recommended that the clustered index column be an integer value. For dimensional data models, it’s common for the Date dimension’s primary key to be a 4-byte integer (YYYYMMDD). For example, the date Aug 4, 2010, is converted to 20100804. The clustered index key for the fact table is also a date represented in this format.
Chapter 5 includes details about performance and query plans when accessing clustered vs. heap tables.
SQL Server Data Type Considerations
As a general rule, data modelers should choose the smallest data types when possible to reduce the total size of a record. This in turn reduces the amount of storage, disk I/O, and network bandwidth required to load and retrieve the data. This section reviews the data type options available to data modelers.
Character Data Types
The general rules of thumb for choosing character data types are:
Only use nchar and nvarchar when the universe of values spans or will span multiple languages. This is because the nchar and nvarchar data types are twice as large (2 bytes vs. 1 byte) as their char and varchar equivalents.
Use varchar or nvarchar for columns with descriptive values, names, or addresses. These varying data types are represented by a length and value, which make them slower to locate within a data page but smaller to store when there’s any variability in the length of the values.
Source system char data types often are space-padded, so a name column defined as a char(8) would represent the name “Smith” as “Smith “.
Use char for columns containing coded values. SQL Server char columns are more efficient because they are stored on disk in fixed locations within a record. Columns containing codes and abbreviations, such as StateCode, should be a char(2) not varchar(2).
For more information about SQL Server character data types, see the following links:
char and varchar
nchar and nvarchar (Transact-SQL)
Exact integers are the most common data type found in data warehouses and are used for measures, counts, and surrogate keys. The SQL Server data modeler can choose to model exact integers as:
1 byte (tinyint), 0-255
2 bytes (smallint) -32,768 to 32,767
4 bytes (int) -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
8 bytes (bigint) -2^63 to -2^63
Keep these notes in mind about SQL Server integer data types:
There’s no unsigned small integer, integer, or big integer data types.
Date dimensions are an example where smallints are used, provided that the date range is less than 89 years (i.e., 32767/Days in a year).
Int is the most common data type used in SQL Server data warehouse models.
Bigint should only be used when the maximum value has a chance of exceeding 2.14 billion, the maximum value for an int.
GUIDs are 16-byte integers and are a convenient way to generate unique surrogate key values. However, they are not recommended if other options such as IDENTITY columns or a key generator are available. This recommendation is discussed in more detail in Chapter 3 – Data Integration.
For more about SQL Server integer data types, see int, bigint, smallint, and tinyint (Transact-SQL).
Numeric Data Types
SQL Server includes decimal, numeric, real, float, money, and smallmoney data types to store numbers. For data warehousing implementations, the use of precise data types—decimal, numeric, money, and smallmoney is recommended. The money data type can be used instead of decimal as long as requirements don’t include definitions for more than four decimal digits.
The use of approximate data types such as real and float are not recommended because they are approximate numbers and, as such, can lose precision when aggregated.
If source data is stored in the real data type, there can be a slight loss of data precision when converting to decimal, but decimal data types are more accurate for querying in WHERE conditions and are typically more compatible with applications consuming data from a data warehouse.
See Using decimal, float, and real Data for more information about SQL Server real data types.
Date Data Types
Date and time values are relevant to data warehouse implementations from the perspective of designing calendar dimensions. Depending on the grain of the data in a data warehouse, modelers can decide to have a single calendar dimension or one dimension for date values and another for time values.
The datetime2 data type should be considered for columns that store full date values as long as this date detail includes hours, minutes, seconds, and milliseconds. The date data type should be considered if the data contains only year, month, and day values due to the storage savings.
If a dimension holding values for time is modeled, you should use the time data type.
In case of two calendar-type dimensions, fact tables will have both surrogate keys, providing for more efficient data analysis.
You can find more information about SQL Server date data types at datetime (Transact-SQL).
Other Data Types
SQL Server supports many other data types. For a complete list, see Data Types (Transact-SQL).
Data warehouses, however, mostly use the data types we covered in this section. This is predominately because data warehouses are architected for most efficient data analysis.
Note that the emergence of XML as a standard for data integration shouldn’t translate into XML being used as a data type for data warehouses. Data modelers should review the XML data and decide whether to shred it into a table structure more conducive to efficient querying.
Very Large Data Sets
When you are architecting data warehouses with very large data sets, you can run into potential data consumption and data management difficulties.
While there are number of considerations to keep in mind when architecting for VLDBs, major points related to the data warehouse architecture revolve around removing contention between read and write operations and maintaining the effectiveness of data retrieval and indexes.
Read and Write Contention
Organizing data and log files and tempdb onto separate disks is the focus of the effort to remove conflicts between read and write processes. This topic is covered in the Database Architecture section earlier in this chapter.
Effective Data Retrieval and Indexes
The size of a VLDB often introduces difficulties with quick retrieval of data needed for reports and analysis. Data consumption also becomes more complicated because indexes tend to grow to the point that their maintenance becomes impractical. You can address these issues with SQL Server by incorporating a partitioning strategy for both data and indexes.
Table partitioning, which we referenced in the Database Architecture section above, is an important tool for getting the most value out of your data warehouse data. See Chapter 4 for details about table partitioning, and see Chapter 5 for how to set up partitioned tables for efficient query access.