Chapter 2: Data Architecture

Download 232.2 Kb.
Date conversion29.04.2016
Size232.2 Kb.
1   2   3   4   5   6   7   8

Data Warehouse Concepts

This chapter’s focus is on data warehousing, not business intelligence (BI). Figure 2-8 illustrates the different data responsibilities and what is considered a data warehouse component or process.

Figure 2-8: Data warehousing vs. business intelligence

Data responsibilities are segmented as follows:

  • OLTP source systems and external data feeds are responsible for data creation

  • Data warehousing focuses on data production

  • BI focuses on data consumption

Data Warehouse Components

The data warehouse is comprised of the following components:

  • The consumption area serves up information to downstream consumers through SQL queries.

  • The production area is where source data is transformed, normalized, and consolidated. Note that a Data in area is common within data warehouse implementations and is typically housed within the production area.

  • The metadata area is populated with “data about data”—that is, business, technical, and process metadata providing detail behind the actual data, data models, and data integration processes.

  • The platform is the system or systems where the data warehouse resides.

Data integration processes are responsible for the movement of data from sources to destination databases. See Chapter 3 – Data Integration for details about this topic.

The data warehouse is responsible for optimizing the query performance within the data consumption area. BI components are beyond the scope of this chapter but include:

  • Data presentation, including decision-support systems, reports, and analytics

  • Data delivery channels

  • Downstream data stores, including data marts and semantic data models (e.g., OLAP)

Business Intelligence = Data Warehousing?

There are many examples within the SQL Server community where the terms business intelligence and data warehousing are used interchangeably. In reality, these are two separate disciplines. This is especially true for enterprise data warehouses due to an EDW’s scope, complexity, and large volumes of data.

As seen in the data warehouse maturity model, as a data warehouse matures, the data warehouse team spends more and more resources on data integration. This shifts the focus from data consumption to data production.

Simply put, data warehousing and BI differ because:

  • The primary focus for a data warehouse is the production of data.

  • The primary focus for BI is the consumption, presentation, and delivery of the data produced by the data warehouse.

One example of where BI is confused with data warehousing within the SQL Server community is the AdventureWorks samples available at the Microsoft SQL Server Community Projects & Samples site. These samples include the following databases:

  • A sample OLTP database (AdventureWorks2008R2)

  • A sample data warehouse database(AdventureWorksDW2008R2)

  • A sample SQL Server Analysis Services (SSAS) database, or cube

The AdventureWorks sample data warehouse link provides more information about the AdventureWorks data warehouse and supporting scenarios in which data warehouse, data mining, and Online Analytical Processing (OLAP) are relevant. However, note that these are BI scenarios because the focus is on consumption, not production. There are no samples for the data warehouse production scenario; instead, the data warehouse is populated directly from Comma Separated Values (CSV) files.

In summary, BI focuses more on data consumption and should not be equated to data warehousing, which concentrates more and more on data production as it matures, especially when you are working with very large data volumes.

Data Warehouse Architectures

As we noted earlier, much of the data warehouse literature is grounded in one of two different implementation approaches:

  • The top-down approach is often used when describing Bill Inmon’s Corporate Information Factory reference architecture.

  • The bottom-up approach is often used when describing Ralph Kimball’s dimensional modeling and Enterprise Data Bus strategy.

The top-down approach historically has led to a centralized EDW, and the bottom-up approach has led to federated data marts. This section reviews these two approaches and presents a third approach used for many data warehouses.

The centralized EDW, shown in Figure 2-9, was the first data warehouse database architecture. This architecture creates a central repository for all of an organization’s integrated data.

Figure 2-9: Centralized EDW

As stated earlier, the sheer scope of implementing an EDW often results in “analysis paralysis”—that is, inordinate amounts of time spent gathering requirements and designing subject area data models. These extended development cycles increase the risk of user requirement changes, user requirement misinterpretation, and ultimately, a failed implementation.

In 1996, Ralph Kimball published The Data Warehouse Toolkit. This book introduced dimensional data modeling to a large population and contained examples of dimensional data models for a variety of vertical scenarios. Kimball was also active on the lecture circuit and started to support federated data marts over a centralized EDW. Figure 2-10 illustrates the federated data mart approach.

Figure 2-10: Federated data marts

In this architecture, a separate data mart is implemented for each business process and subject area. This strategy allows for shorter development cycles, lowers the risk of changing and misunderstanding user requirements, and delivers partial solutions faster.

However, a drawback of federated data marts is that it often results in “multiple versions of the truth,” even when the ideal approach for federated data marts is to have a common data model.

The data warehouse space started seeing the implementation of both centralized data warehouses, which often presented aggregated data to business users in support of faster SQL queries, and subject-oriented data marts created and populated from the central data warehouse. This resulted in a third approach, the hub-and-spoke architecture, shown in Figure 2-11. This model incorporates the benefits of a centralized data warehouse database and federated data marts:

  • The central data warehouse database provides business consumers with “a single version of the truth.”

  • Separate data marts provide business consumers with better performance.

Figure 2-11: Hub-and-spoke architecture

The hub-and-spoke approach has downstream data marts that are fed from a common data warehouse database. Data marts return consistent results because they are all populated from the data warehouse database, which contains a single version of the truth. Performance is improved for business unit analysis because the marts contain less information than the data warehouse database and also have a smaller user community.

Note that scope issues still exist with the hub-and-spoke configuration—the data warehouse database still contains multiple subject areas and many consolidated sources.

Microsoft’s Parallel Data Warehouse (PDW) is a distributed architecture that supports the hub-and -spoke approach. PDW provides a publish model that supports the parallel loading of data mart spokes from the data warehouse hub. This publish model reduces data integration processes and the resources required to maintain these processes. You can read more about PDW in the MSDN article Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation.

One of the primary deliverables for the data warehouse team is the data model or models. The next section provides an overview of this topic.

Data Models: Normalized to Denormalized

One characteristic of many data warehouses is that the data is transformed from a normalized to denormalized form as it makes its way from sources to the production and consumption areas, as Figure 2-12 shows.

Figure 2-12: Normalized to denormalized data model

Most source data originates from transactional line of business (LOB) systems. These LOB systems’ data models are highly normalized in support of Online Transaction Processing (OLTP). Other sources, mostly file-based, are typically in a denormalized format and have various sources (e.g., external vendors and partners, exports from internal systems). In some cases, these files are created and maintained by business and technical users within applications such as Microsoft Excel and Access.

The production area is where source data is consolidated and rationalized. The subject areas within the production area are typically modeled in a normalized format, but less so than the source LOB systems.

This data is then transformed, denormalized, and aggregated when it flows to the consumption area. Dimensional data models are an example of a denormalized structure optimized for data consumption; Figure 2-13 shows one example, the AdventureWorksDW2008R2 FactResellerSales snowflake data model.

Figure 2-13: Snowflake dimension example: FactResellerSales

Note that this sample database is modeled to demonstrate the capabilities of SSAS. A normalized data model would not have the Parent-Child and the Snowflake dimensions highlighted above.

Normalized vs. Denormalized

The key differences between normalized and denormalized data structures are as follows:

  • OLTP systems are where data is created. Normalized data structures support a high-volume transactional workload, which consists of inserts, updates, deletes, and selects of individual or a small number of records.

  • In a denormalized approach, data structures are optimized for data consumption (.i.e., workloads that process high volumes of records).

Dimensional modeling is one common example of a denormalized approach where information is grouped into dimensions and facts, as we see in Figure 2-13. Facts contain surrogate key pointers to dimensions along with mostly numeric measures. Dimensions contain flattened or snow-flaked hierarchies and relationships for entities and relevant attributes.

Deciding on the amount of denormalization within the consumption area involves balancing consumer requirements with source system models and load complexity, including the following considerations:

  • Whether to use a star schema or snowflake schema implementation. A star schema is a fully denormalized implementation. See the Kimball Group’s Web site for more information about star schemas, snowflake schemas, and dimensional modeling.

  • How much to denormalize data for the most efficient SQL access. This, however, can increase load complexity. When choosing the appropriate level of denormalization and complexity of data integration processes related to it, consider the data warehouse requirements for analysis and data latency as well as overall delivery milestones and timelines for the data warehouse initiative.

  • Whether to use database views to present a denormalized view of normalized data.

    • This implementation pattern allows for introducing denormalized data structures without having to materialize tables. With this approach, data integration for existing tables doesn’t need to change.

    • This approach is sometimes seen as a feed for a downstream semantic layer such as SSAS.

    • Views will have a negative impact on performance unless they are materialized.

Figure 2-14 shows different table structures for a hypothetical entity containing information about vendors for an organization. This illustrates how a data model changes as it flows from the source to the production area and then to the consumption area.

Figure 2-14: Example source, production, and consumption area table structures

Production area data store characteristics are as follows:

  • Tables are denormalized some, but the overall data model is still normalized.

  • Common naming conventions are applied to tables, attributes, and other database objects.

  • Business keys are identified and enforced, as are relationships between entities.

  • Business rules for data types for various types of string and numeric attributes are introduced and enforced.

  • All of these are building blocks that help provide for consistency and manageability across a data warehouse.

A consumption data store has these characteristics:

  • Tables are denormalized. These structures are designed for most effective data retrieval.

  • Natural keys in dimensions are referenced via surrogate keys.

Natural and surrogate keys are discussed in more detail in the data modeling section later in this chapter.

In summary, choosing an appropriate data warehouse architecture and data model approach are central to the success of a data warehouse. The decisions about which data warehouse architecture and data models to use, however, are independent of one another—a federated data mart can have a more normalized data model, while a centralized EDW can have both normalized and denormalized data models, for example.

Once you’ve selected the architecture for your data warehouse, the next deliverable is the database architecture, which we cover in the following section.

1   2   3   4   5   6   7   8

The database is protected by copyright © 2016
send message

    Main page