This section focuses on how different data warehouse data stores and data categories are organized into logical and physical databases.
Choosing the most appropriate database architecture for a data warehouse system is an important aspect of the overall data warehouse strategy: Poorly implemented database architecture will negatively affect data warehouse performance and overall user experience.
It’s not uncommon to find SQL Server data warehouse implementations in which no effort was made to create more than a single default filegroup, which results in all data, indexes, and logs residing on the same number of disks. Despite having a strong data model, such a data warehouse initiative will be labeled a failure because of:
Backup files becoming too large, resulting in excessive backup and restore times
Data files becoming too large to manage, causing excessive execution times for rebuilding indexes and updating statistics
This underscores the need for an effective physical database design consisting of files and filegroups. In addition, a good database schema design can make database objects easier to access and simplify the security implementation.
Databases, Schemas, and Filegroups
Data modelers are responsible for developing the data warehouse data models, which are accessed by the three roles mentioned below. A data warehouse can be viewed differently depending on your role and your objectives:
Database developers view the data warehouse through a SQL lens. Their objective is to select the correct information from the data warehouse as efficiently as possible.
Data integration developers view the data warehouse as a destination. Their objective is to populate the data warehouse as efficiently as possible.
Database administrators view the data warehouse as a precious resource. Their objective is to ensure that the data warehouse is operational, secure, and high-performing—both when serving up data to business consumers and during database maintenance operations.
Figure 2-35 shows these different roles and how they relate to the data warehouse.
Figure 2-35: Data warehouse database logical and physical view
The data warehouse database can be viewed both as a logical entity accessed through SQL and as a physical entity consisting of multiple files stored within the disk subsystem.
Every SQL Server database is stored on disk and has two or more files:
Primary data file – A database has only one primary data file, which contains startup information and pointers to other files in the database.
Log file(s) – A database can have one or more log files.
Secondary data file(s) – These optional files are used to store database objects. Although optional, the use of secondary data files is highly recommended for a SQL Server data warehouse.
Backup file(s) – These files are used to back up information from the data warehouse.
All database files are organized into one or more filegroups. Every file has one filegroup parent, and every database object is associated with a filegroup. The primary filegroup stores all system tables.
A key database architecture task is determining which database objects reside in which filegroups. This task must take the different developer and DBA activities into consideration, including select, insert, and update activity; backup and restore processes; and database maintenance operations such as re-indexing, the updating of statistics, defragmentation, and so on.
Databases are the parent for both logical schemas and physical filegroups and files. Before we go deeper into physical filegroup configurations, it’s useful to review how database objects are logically accessed and organized within SQL Server.
The SQL Server Transact SQL (T-SQL) language supports three-level naming when accessing database objects within the same server:
The following query, which specifies all three levels, returns identical results as the above query:
SELECT * FROM AdventureworksDW2008.dbo.DimCustomer
Every database table has a schema, which we look at in a moment. When a schema isn’t included, the default schema is assumed. The default schema is dbo or the default schema explicitly assigned to the user account by the DBA.
Note that SQL Server linked servers extends this query model to four-level naming, as shown below, to access an object on another server:
However, linked servers are not regularly used in production due to performance and security reasons. Every linked server request has to first connect to a remote server and then retrieve the results before returning them to the requester. More often, SSIS data flows are used to populate tables local to the server, which are then directly accessible within a SQL Server instance.
Schemas provide for logical grouping of tables. Figure 2-36 shows an example in which tables are grouped by the logical domain they belong to.
The decision to create schemas is made by the data warehouse team in the design phase. The following are reasons for using schemas:
Logical grouping – The ability to logically organize tables into groupings or subject areas makes it easier to navigate complex databases such as data warehouses. The example above illustrates how tables can be organized into subject areas.
Logical partitioning – There are many other way to logically group tables; one example might be to group by entity. For example, the Delta-American Airlines merger creates the need to consolidate data from the LOB systems used by each entity prior to the merger. Creating a Delta and AmericanAirlines schema with identical table structures, such as DimCustomer, would be an intermediate structure prior to loading the consolidated DimCustomer table.
Security – The ability to grant and restrict access to database users and roles through different schema simplifies the security approach for a database. For the example in Figure 2-29, you could:
Grant SELECT access to the HR department for the HumanResources schema
Grant INSERT, UPDATE, and DELETE access to the HR Admin role for the HumanResources and Person schema
Grant SELECT access to all data warehouse users for the Person schema
The downside to creating schemas is that the schema name must be included in all database object references. This may not seem like a significant issue for developers and DBAs, but it could be an issue for downstream business consumers. Also, any change in the schema name will require a change to every SQL statement accessing objects within the schema, except in the case where the user is referencing objects within their default schema. Changes in the schema name will not impact the underlying physical structure because schemas are a logical construct.
When using schemas in SQL Server, it is important to recognize that schemas are not equal to owners. Thus, it is not necessary to change owners of objects when owner accounts are being removed. A schema does have an owner, but the owner is not tied to the name.
Note that dbo is the only schema supported in the initial PDW release.
A core database architecture concept is the filegroup, which contains database files and represents the physical aspect of the architecture.
Database tables, indexes, and logs are created on filegroups, while filegroups contain one or more physical files. Proper management of files and filegroups is instrumental to an efficient data warehouse design and maintenance. Here are some scenarios where wise filegroup architecture can provide significant value:
Partitioned fact tables– Large fact tables (with more than 100 million rows) in one database file have benefit from partitioning. Modifying such tables to have their data divided among multiple physical files, with each file stored on a separate physical disk array, would enable the Database Engine to perform multiple parallel reads (one per file), improving read performance. Figure 2-37 shows a simple scenario of a fact table with two partitions.
Figure 2-37: Partitioned fact table
Backup/restore and archiving – For VLDBs, it may become impossible to manage backups and restores for certain files in timely fashion unless multiple filegroups are introduced and separately backed up.
Other database maintenance activities – Partitions also have advantages for indexing, consistency checking, updating statistics, and other maintenance tasks.
For more information about SQL Server filegroups, see the MSDN article Files and Filegroups Architecture.
Regarding database files, it is important to consider the physical architecture of your hardware infrastructure to take advantage of separate disk controllers. In addition:
Placing non-clustered indexes on files separate from files containing data is a common practice to improve data warehouse performance.
A strategy for where to store tempdb in SQL Server is an essential part of an overall database architecture because if not properly architected, write operations to this frequently used database often conflict with read/write operations for data and log files.
The next section of this chapter will address the questions: How many databases should be in my data warehouse, and what factors affect this decision?
To have a high-performing data warehouse, you need to understand and properly manage storage for database objects. The MSDN article Storage Top 10 Best Practices is a good place to start.
Most data warehouses use SAN storage, with disks organized in applicable types of RAID configurations. Depending on disaster recovery policies, some data warehouse systems will reside on database clusters, but which services should be set up for failover depends on an organization’s particular business requirements.
Considerations for Physical Database Design
This section outlines several considerations for physical database design as it relates to logical design and architecture decisions. For more about best practices for partitioning with SQL Server, see the partitioning section of Chapter 4.
Capacity Planning and Designing for Change
Initial capacity planning is required for every data warehouse and includes estimates for initial data size and data growth. In addition, physical designs need to be implemented so that new disks and servers can be added without affecting systems already in production.
Templates in Physical Design
The process of deploying the physical database architecture can be simplified by establishing templates. These templates are deployed to a new database with predefined architecture for filegroups and files. However, these patterns need to be adjusted for special cases—for example, when complexity and size of data in the fact tables requires multiple files for more efficient partitioning.
Some of the recommended practices for using templates in database architecture include maintaining scripts for template databases, including all relevant CREATE statements. These scripts are modified based on naming standards used in a particular data warehouse implementation and are executed to create new database objects.
Partitioning is a valuable technique for very large tables within the data warehouse. Partitioning creates smaller clusters of data, which enables maintenance operations to be applied on a partition-by-partition basis. Partitioning also enables minimum latency because source data is continuously loaded into a passive partition, which gets switched with active partitions on set schedules.
In physical design, typically for purposes of producing proof-of-concept models, the partitioning strategy is implemented after business rules, entity relationships, and attribute and measure definitions are established.
Partitioning is a key tool employed by both DBAs and database developers when working with very large tables and is covered in more detail in Chapters 4 and 5.
Striping is associated with the type of RAID levels implemented as a part of the data warehouse architecture. RAID levels 0, 1, 5, and 10 are typically implemented with SQL Server. This is an important aspect of database architecture because striping improves read performance by spreading operations across disks. The following SQL Server Customer Advisory Team (SQLCAT) article contains more information about SQL Server striping and RAID levels for SQL Server:
Storage Top 10 Best Practices
Data compression can help reduce the size of the database as well as improve the performance of I/O-intensive workloads. However, CPU cycles are required on the database server to compress and decompress the data while data is exchanged with the application.
SQL Server provides two levels of data compression: row compression and page compression.
Row compression helps store data more efficiently in a row by storing fixed-length data types in variable-length storage format. A compressed row uses 4 bits per compressed column to store the length of the data in the column. NULL and 0 values across all data types take no additional space other than these 4 bits.
Page compression is a superset of row compression. In addition to storing data efficiently inside a row, page compression optimizes storage of multiple rows in a page by minimizing data redundancy.
It is important to estimate space savings and apply compression only to those tables and indexes that will yield reduced I/O and memory consumption due to the reduced size. The sp_estimate_data_compression_savings stored procedure can be used for SQL Server 2008 R2 databases.
Data compression for SQL Server is explained in detail in the SQLCAT article Data Compression: Strategy, Capacity Planning and Best Practices.