Data warehouses are read-optimized versions of line-of-business (LOB) systems and other source feeds. The sheer volume of data in a data warehouse can result in queries taking hours to complete while consuming large amounts of server resources. Worst case, queries may never complete within the required time, while negatively impacting other data warehouse workloads. Successful data warehouses require development team members to have a deeper understanding of query dynamics, as well as of best practices for monitoring, and performance tuning.
This chapter’s audience is the data warehouse team members responsible for developing and optimizing queries and monitoring data warehouse query workloads on the Microsoft SQL Server platform.
This chapter is organized into following sections:
Figure 5-1 shows how querying, monitoring, and tuning are related.
Figure 5-1: Query optimization cycle
Database professionals responsible for building large data warehouse applications need to be equipped with knowledge of SQL Server techniques such as table partitioning, indexing strategies, query isolation levels, execution plans, SQL Profiler, ETL techniques, and a variety of performance-related methods for optimization.
However, this is only a first step. The data warehouse also needs to be systematically monitored for excess resource utilization and anomalous behavior. Metrics collected from monitoring processes should enable query and server base lining, expose performance trending, and help with long-term capacity planning.
A key deliverable of a monitoring solution should be the ability to pinpoint reoccurring queries that carry an exceptionally high database cost. This information should then form a “to-do” list for subsequent performance tuning efforts. This approach should provide the highest return on investment for the time devoted to optimizing data warehouse queries.
Developing Queries with Attention to Both ‘What’ and ‘How’
There are times when it is enough to simply know the answer—the “what.” For example, when your spouse asks, “What time do you plan to be home from work?” you can generally give an ETA without delving into the method of calculation based of tasks, promises, deadlines, and commuting hazards that stand in your way.
Contrast this to the “what and how” of a common database query. Suppose your business consumers need to determine the number of orders placed for a widget during the previous day. Of paramount importance is the “what” (i.e., the correct number of orders), which you can produce by properly incorporating business rules and database logic. However, you may also need to examine how the results are derived if the query runs too long.
Now consider a large data warehouse query. In this example, let’s say you need to determine the changes to sales volume based on a correlation to Internet banner ad impressions. This scenario typically requires the analysis of large tables and filtering based on a variety of criteria. You, of course, still need accuracy, but the query plan may become critical, especially if the execution time expands from minutes to hours, and beyond. Bottom line: You will want to become very familiar with estimated and actual query plans when working in a data warehouse—or in other words, the “how” of your SQL queries.
Note: The term query in this chapter refers to all SELECT, INSERT, UPDATE, and DELETE statements that are often batched together in scripts, stored procedures, and ad hoc queries.