Master data and Master Data Management (MDM) are hot topics in the industry today. Best-of-breed MDM software products have hit the market, and major software vendors have added MDM capabilities to their product suites. SQL Server 2008 R2, for example, introduced Master Data Services (MDS).
However, it’s still early in the adoption curve. There is still confusion in this area, and few organizations are managing all of their master data within an MDM solution.
This section provides more clarity around the topics of master data and MDM and gives guidance around the options that organizations have for managing their master data.
Figure 2-24 illustrates some questions organizations are asking today, such as:
What is our organization’s MDM solution?
Is there an overlap between data warehouses and MDM?
Do I need an MDM software product? And if so, how does my MDM software communicate with my data warehouse?
Figure 2-24: Master Data Management questions
Before continuing, it’s useful to look back in time to provide a historical perspective on how the MDM industry arrived at where it is today. Figure 2-25 shows a timeline with significant events leading up to the present.
Figure 2-25: Master Data Management timeline
According to the timeline:
Data warehousing has been part of the database industry’s lexicon since the early 1990s. A major selling point for an EDW was that it provided a “single version of the truth.”
The 1990s was also a decade that saw wide adoption of LOB systems, starting with Enterprise Resource Planning (ERP). One business driver for these systems was Y2K concerns.
The advent of the Internet and multiple customer channels drove demand for CRM systems. And wide adoption of CRM created challenges for data warehouses because there were now “multiple versions of the truth” implemented in complex enterprise-level systems such as SAP and Siebel (now Oracle).
Customer Data Integration (CDI) started appearing in analyst writings around 2000. Although this software category never gained critical mass, it did raise awareness of the need to provide “a single version of the truth” for key business entities.
The concept of a data governance discipline started to emerge after organizations recognized that data quality was an ongoing enterprise discipline, not a single-point-in-time solution.
The term Master Data Management began to appear in 2002-2004 and was followed by the emergence of best-of-breed MDM products.
In 2010, SQL Server 2008 R2 shipped with Master Data Services 1.0.
An Accenture article titled Master Data Management (MDM) – Hot Topic Getting Hotter outlines MDM common business definitions and management processes for business and reference data. It also stresses the importance of data governance, stating:
The technology side consists of master data management systems that provide “a single version of the truth.”
Here’s where the confusion lies. Today’s consensus is that a data governance discipline is necessary, but how about “a single version of the truth”? Where does the authoritative version of the data reside? Is it within a master data database, the data warehouse, or both? What tools are used to create it: the data warehouse and data integration processes, an MDM system, or a combination of the two?
The first step is to define master data and Master Data Management.
What Is Master Data?
Many papers and articles include definitions of master data. The following definition is from the MSDN article The What, Why, and How of Master Data Management, by Roger Wolter and Kirk Haselden:
Master data is the critical nouns of a business. These nouns generally fall into four groupings: people, things, places, and concepts.
(Note that this definition is abbreviated for readability.)
This article also categorizes an organization’s data into data types, as Figure 2-26 shows, and describes which ones are stored within a data warehouse.
Note: Instead of the term data type, which could be confused with the same term commonly used in physical database design, in this chapter, we use data category.
Figure 2-26: Data categories
The different data categories are:
Unstructured – Data found in email, magazine articles, corporate intranet portals, and so on
Metadata – Data about data, including report definitions, database column descriptions, and configuration files
Master – The critical nouns of a business, which generally fall into four groupings: people, things, places, and concepts
Transactional – Data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions
Hierarchical – One- or multi-level relationships between other data, such as organizational charts and product lines
Reference – A classification of a noun or transaction; often has two columns—code and description—and examples include marital status and gender
Notice in Figure 2-26 that:
Master data exists within a data warehouse and can be fed from an MDM system.
Reference and hierarchy data are used to classify master data and transaction data.
Unstructured data is outside the scope of the data warehouse.
Metadata is valuable but out of scope for this discussion.
What Is Master Data Management?
The next step is to define Master Data Management. David Loshin in the book Master Data Management defines MDM as:
A collection of best data management practices that orchestrate key stakeholders, participants, and business clients in incorporating the business applications, information management methods, and data management tools to implement the policies, procedures, services, and infrastructure to implement the capture, integration, and subsequent shared used of accurate, timely, consistent and complete master data.
There are many other definitions, but the key points are:
MDM is not only a technical solution; it’s an ongoing set of processes and an organizational discipline.
Transactional data is out of scope for MDM.
Classification data—that is, reference and hierarchy data—is in scope for MDM.
Where Do Data Warehousing and MDM Overlap?
Now that we’ve defined master data and MDM, the next question is: Where’s the overlap between MDM and data warehousing?
The simple answer to this question is master, reference, and hierarchy data.
Master data has existed within the data warehouse long before the term Master Data Management was coined. This section shows where master data exists within a data warehouse; Figure 2-27 shows the data stores in scope for this discussion.
Figure 2-27: Data stores containing master data
Let’s start with the data warehouse destination and work our way back to the original source.
Most data marts, either within the consumption area or downstream from the data warehouse, use a denormalized data model for optimal query access. Figure 2-28 shows an example within the SQL Server sample data warehouse, AdventureWorksDW2008. (Note that this chapter assumes that the reader is familiar with dimensional data modeling concepts.)
Figure 2-28: Reseller Sales snowflake schema
The Reseller Sales fact table contains all sales transactions for the Adventure Works bicycle shop. This fact table contains sales order quantity, product cost, and other numeric values recorded at the sales transaction. The fact table also consists of foreign keys to dimensions, which are used to filter the results from the joined fact and dimension tables. Note that transactional data is out of the scope of MDM and fact tables are not considered master data.
Each dimension in the data warehouse data store is sourced from master data, reference data, and hierarchy data residing in the staging data store. ETL processes process source data en route to the staging area. This processing includes cleansing, consolidating, and correlating master, reference, and hierarchy data as well as linking it to transaction data.
Figure 2-29 shows a partial mapping of three dimensions to their master, reference, and hierarchy data sources within the staging data store. Note that these mappings are conceptual and not actual mappings since the Adventure Works database samples do not include a staging data store.
Figure 2-29: Mapping three dimensions to their data categories
Note that the objective of this example is not completeness but rather to show that:
Each dimension is in a denormalized form.
Its source is the staging data store.
The production area contains master, reference, and hierarchy data categories, which are the same categories managed by a data management system.
The production area also contains transactional data.
Thus, the answer to the question, “Is there an overlap between data warehousing and MDM?” is:
Yes, the master, reference, and hierarchy data categories exist in both.
Given this overlap, the next question arises:
Can an organization forego a data warehouse when it has an MDM software product?
The answer is no, because MDM software does not manage transactional data. However, transactional data holds all the numeric and monetary values used to measure an organization’s performance—a key objective for a data warehouse.
The next question then becomes:
Do I need an MDM software product? Does an MDM solution require an MDM software product?
The answer here is maybe. Below are some questions to determine these answers for your organization:
How many data sources contain master data?
Does the master data have strong natural keys?
How mature is the current data warehouse solution? Does it already handle master data?
What level of manual effort is required for reference list maintenance?
What if one reference table exists and is separately maintained within multiple source systems?
What level of hierarchy management exists? What level is required?
What if the internal hierarchy is a super-set of a standard? For example, what if a health care organization’s diagnostic codes are different than the ICD-9 standards?
Let’s look at how your answers to some of these questions might influence your decision to acquire and utilize an MDM software product.
How Many Data Sources Contain Master Data?
It’s useful to understand where master data is sourced from. Figure 2-30 shows an example of master data within a production area and its sources.
Figure 2-30: Master data by category and sources
Some notes about this source data:
Master data resides within the LOB systems.
Master data can be loaded from multiple LOB systems; this requires the data to be normalized and consolidated.
Reference and hierarchy data can reside in LOB systems, internal feeds, and external feeds.
Traditionally, internal feeds were created and maintained with Excel.
External data traditionally was structured in CSV format, but many industry standards are moving toward an XML format.
The need for a separate MDM software product increases as the number of sources increase.
How Mature Is the Current Data Warehouse Solution? Does It Already Handle Master Data?
The next step in determining whether an MDM software product is required is to review the data integration processes already in place. Figure 2-31 shows a high-level view of the data integration process used to load master data from one source to the staging area. Note that this chapter presents these data integration processes at a conceptual level; refer to Chapter 3 for details about data integration.
Figure 2-31: Master data integration flow
The data integration steps for master data are as follows:
Master data is extracted from the source.
Data quality business rules are applied, and data flows to an exception area if it fails a data quality check.
Master data is then normalized (i.e., attributes are used as lookups into reference lists and hierarchies). Data may flow to an exception area if the attribute does not match any entries in a reference list or hierarchy.
The process checks whether the master data already exists.
Master data flows to the production area if it doesn’t exist or if it has changed.
The above conceptual process for loading master data is very similar to processes used to populate dimensions and is familiar to data integration developers.
The simple answer to whether an MDM software product is required for loading master, reference, and hierarchy data is no if existing data integration processes already exist and are working without issues.
If these master data integration processes are not in place, the answer is then maybe, depending on the complexity of the master data integration processes, combined with other factors such as:
Whether the existing data warehouse team (developers and operations) are comfortable learning a new technology
Whether the new technology is affordable (both in acquisition and ongoing costs).
The takeaway from this section is that introducing an MDM software product has associated costs. The decision on whether an MDM software product is required should take these costs into consideration along with the complexity of the master data processes.
The next section introduces two master data scenarios and the data integration processing required for each.
Employee Data – One Source
In this first scenario, the data exists in one and only one source system, as Figure 2-32 shows. In this scenario, the Employee table is populated with Education and Marital Status reference data sourced from the same system.
Figure 2-32: Master data example – one source
A separate MDM product is not required in the simplest of scenarios.
Employee Data – Multiple Sources
However, master data often exists in more than one place—typically in multiple source systems, but it can also be an external or internal feed. The next example assumes that Education data comes from an external feed, as shown in Figure 2-33.
Figure 2-33: Master data example – multiple sources
Note that even in this simple multi-source scenario the following issues arise:
Reference lookup failures – The reference table in staging is sourced from the LOB system. The second source containing employee education information uses different descriptions. These cases require intermediate mapping tables, which take codes or descriptions from the source and map them into the staging reference table.
Production area table lookup failures – The lookup against the production area Employee table fails in the above scenario. It demonstrates two common reasons for these lookups failures:
Lack of a strong natural key – Employee Id is a strong natural key and is ideal for comparing employee data across internal LOB systems if, and only if, this key exists in each system. If it doesn’t, then the source-to-stage lookup becomes more complicated and error-prone.
Inexact matching – Name and/or address matching are perfect examples of where the matching process is not completely deterministic. In these cases, either a fuzzy matching algorithm or product or product capabilities specifically developed for name and address matching is required.
SQL Server 2008 R2 Master Data Services (MDS) may be beneficial in this scenario if the implementation team thinks that leveraging MDS capabilities are preferable to adding capabilities to existing features (such as Fuzzy Lookups). Here are a couple of things to think about when considering MDS:
Reference table management – What processes/interfaces are used for creating and maintaining reference tables and mappings that aren’t sourced directly from a LOB system?
Name/address matching – Can the team leverage SQL Server Integration Services (SSIS) Fuzzy Lookups and provide tools for data stewards to make a human decision when the fuzzy matching doesn’t produce strong results?
Complex Master Data: Multiple Sources and Hierarchy Management
The multi-source Employee data example above provides a simple illustration of master data within organizations. Many organizations have more complicated master data needs as well as other advanced MDM needs such as creating and maintaining multiple versions of very complex hierarchies (e.g., organizational charts, chart of accounts, disease classification, etc.).
In these cases, organizations can benefit from advanced data integration capabilities such as merge/match. In addition, version control allows data stewards to work on new reference data and hierarchies prior to publishing them in the data warehouse. Workflow approval supports a multi-tier review process, and role based security lets data stewards work only on subsets of master data. Each of these capabilities supports complex master data environments and are features within SQL Server 2008 R2 MDS.
In summary, master data and MDM are hot topics today but are old problems within data warehousing. Many existing data warehouse implementations already process master data effectively. However, organizations with complex master data needs that don’t have enterprise MDM capabilities in place can benefit from adopting an enterprise MDM solution such as SQL Server 2008 R2 MDS.
Now that we’ve covered the key concepts involved in data warehouse architecture, we are ready to address the responsibilities that are owned or co-owned by the data architecture team.