Chapter 5: Querying, Monitoring, and Performance Tuning



Download 288.45 Kb.
Page2/4
Date conversion29.04.2016
Size288.45 Kb.
1   2   3   4

Querying


This section covers querying SQL Server with an emphasis on the challenges encountered with large queries in a data warehouse. We’ll start with a bird’s-eye view of a query plan, and then look at the common approaches to managing query response times for very large tables: namely, optimized schemas, table partitioning, and targeted indexes.

Scripts are included for this chapter to both generate sample data and query data illustrating the techniques we cover. Due to the random nature of generating test data and the various platforms that run SQL Server, your results may vary slightly, but the underlying query fundamentals should remain the same.


Query Execution Plans


Throughout this chapter, we will be referring to query plans and, specifically, the graphical query plan, which is accessible through SQL Server Management Studio (SSMS). When querying in SSMS, you can enable the display of the query plan by clicking the icon shown in Figure 5-2.



Figure 5-2: Enabling the graphical execution plan

Once you’ve enabled the query plan display, and after a query completes, an additional Execution Plan tab will appear in the result pane containing the plan, which details how SQL Server actually resolved, or processed, the query. Figure 5-3 shows a simple query and its execution plan.





Figure 5-3: Sample graphical execution plan

In this example, the Execution Plan is contained in the last of the three tabs presented in the lower query result pane. A brief right-to-left reading of the plan indicates that 91% of the query cost is due to a Clustered Index Scan, followed next by a Stream Aggregate operation with a 9% cost. You can obtain details of these operations, as well as the connecting arrows (i.e., row counts), by hovering over a graphical section of the execution plan. The width of the arrows is also a visual indication of the relative number of rows that are fed into or result from an operation.

We will refer to query execution plans throughout this chapter. For more information about graphical query execution plans, see Displaying Graphical Execution Plans (SQL Server Management Studio).

Sample Database Overview


This chapter includes two scripts that are used to create sample data and three scripts for querying this sample data as well as data collected by SQL Server’s Management Data Warehouse (described later in this chapter). To gain a better understanding of the concepts presented in this chapter, you are encouraged to download and run the scripts associated with this chapter.

The following T-SQL scripts can be downloaded from the Data Warehouse Toolkit area of Codeplex:



Script

Description

#1 - Create DWToolkit

Script to create the DWToolkit database

#2 - Populate DWToolkit

Script to populate DWToolkit with randomly generated data

#3 - Querying Section

T-SQL code to run and test with in the Querying section

#4 - Monitoring Section

T-SQL code to run and test with in the Monitoring section

#5 - Performance Tuning Section

T-SQL code to run and test with in the Performance Tuning section

Script #1 simply creates the database called DWToolkit. You will want to ensure that this database has been successfully created before running Script #2, which populates the database.

The remaining scripts (#3, #4, and #5) correspond to the sample code used in the three major sections of this chapter: Querying, Monitoring, and Performance Tuning. They are not intended to be run as a part of the DWToolkit database setup.

Figure 5-4 shows the tables you should see in the DWToolkit database after running Script #2.



Figure 5-4: Table container for the DWToolkit database

The four Sales tables that are created have almost identical content in terms of the number of rows and columns. What is different with these tables is their physical organization in SQL Server. Table 1 lists the design purposes of these tables.



Table

Design

Sales

Partitioned by a cyclical month number (rotating window)

SalesByDate

Partitioned by month (sliding window) with a Clustered Index

SalesByDateHeap

Partitioned by month (sliding window) as a Heap

SalesNP

No partitioning

Table 1: Physical organization of the Sales tables

The Product table contains the item numbers referenced by all of the Sales tables. You can rerun Script #2 at any time to regenerate the test data.


Benefits of De-normalization from a Data Warehouse Query Perspective


Despite the storage efficiency and design elegance of a normalized database schema, the simple fact is that there are query costs associated with joining tables together. The more tables and rows you need to join, the greater this expense. For OLTP systems, the cost of joins is often mitigated by using highly selective queries. But for data warehouses, with potentially billions of rows of data and aggregations covering large data sets, the query expense for table joins can be considerable.

Other query benefits to de-normalization include a simplified, more intuitive schema; encapsulation of database rules; a reduction in foreign keys; and broader index opportunities. Another perhaps underrated feature of de-normalization is the benefit of working with more understandable query plans.

Ideally, the plan devised by SQL Server should illustrate any inefficiencies by displaying thick connecting lines and large sub-tree costs for expensive operations. However, if the plan itself contains hundreds of steps, it may be difficult and time-consuming to discern the actual bottlenecks and circumvent a performance issue.

Another consideration when dealing with highly complex query plans is that the optimizer does not perform an exhaustive search for the best query plan. Instead, it applies some heuristics to lower the number of plans to be evaluated in order to come up with a good plan in a reasonable amount of time. In other word, the optimizer will perform a fast search for a good query plan, which may preclude the best plan.

Although there are benefits of de-normalization, it’s also understood that this schema design technique is not always possible or practical. In the event that de-normalized techniques are not implemented within your data warehouse, you can still use strategies such as indexed (materialized, schema bound) views and summary tables. These practices can also significantly reduce the query costs associated with huge joins and table aggregations (discussed later in this chapter).

Partitioned Tables


It’s not unusual to have tables in a data warehouse that contain billions of rows of data. With tables of this size, querying, inserting and updating, and maintenance tasks such as data archiving and re-indexing require more than casual consideration.

One approach to managing very large tables is to take advantage of table partitioning. This strategy essentially breaks a large table into smaller parts and allows independent partition access for faster queries as well as other maintenance tasks.

You will want to keep in mind that partitioned tables often require some finesse to ensure maximum query performance. The diagram in Figure 5-5 shows the logical organization of a partitioned table. A partition function, which is based on range partitioning, is used to direct table rows to a given physical partition number. Each partitioned table can have up to 1000 partitions. Index pages can be either aligned (as shown in Figure 5) or non-aligned with the data pages of the partitioned table. Aligning an index with the partitioned table is preferred since SQL Server can switch partitions quickly and efficiently while maintaining the partition structure of both the table and its indexes.

Note: a modulus column can also be used to implement hash partitioning, and will be demonstrated later in this section.



Figure 5-5: Partitioned table organization

Queries Based on the Partitioned Column


To get started, let’s look at a table called SalesByDate, which has 40 partitions based on month.

Note: Any date division (day, week, etc.) can be used for a partition as long as the total range distributions do not exceed 1000 active partitions. These can also be based on calendar, fiscal or other date designations.

The partition function in this example uses a partition data type of Date, as illustrated in the following code snippet:

CREATE PARTITION FUNCTION [SalesByDate_PartitionFunction](DATE)

AS RANGE RIGHT FOR VALUES

( '2008-02-01', '2008-03-01', '2008-04-01', '2008-05-01', ... )

Note: This scheme can also be the basis of a monthly sliding window partitioning strategy where, each month, the oldest partition is switched out and then merged (i.e., collapsed) and a new partition is split (created) from the most recent partition. For more information on maintaining ‘sliding window’ partitioned tables see: http://blogs.msdn.com/b/menzos/archive/2008/06/30/table-partitioning-sliding-window-case.aspx

Now let’s submit a query to determine the total quantity of items sold during the most recent 7 days:

-- Script Q-1

SELECT SUM([Quantity])

FROM [SalesByDate]

WHERE [Date] BETWEEN GETDATE()-7 AND GETDATE()

Intuitively, you know that the query should probably access just one or, at the most, two partitions. However, if the query scans each partition to satisfy the date predicate (i.e., within the previous 7 days), it’s clear that the partition design is not being leveraged for improved query performance.

You can determine the number of partitions being accessed in a query by hovering over a scan or seek operation in a graphical execution plan (discussed later in this chapter) to see if it matches your expectations, as Figure 5-6 shows.







Figure 5-6: Partitions used in sample Seek operation

In this example, all 40 partitions were accessed, meaning that the optimizer did not take advantage of partition elimination based on the relationship between the date-based partition scheme and the date-related predicate. There are a variety of reasons for this behavior, including:



  • The use of certain functions and expressions

  • Implicitly converted data types within the WHERE clause

In the above example, the problem is with the data type (datetime) returned by the GETDATE() function. The following revised code should correct this issue by using the CONVERT() function to change the predicate comparison to a DATE data type. This modification will alter the plan to access just one or two partitions:

-- Script Q-2

SELECT SUM([Quantity])

FROM [SalesByDate]

WHERE [Date] BETWEEN CONVERT(DATE, GETDATE()-7)

AND CONVERT(DATE, GETDATE())

Hover over the Clustered Index Seek operation within the resulting query plan to verify the number of partitions that are accessed.

An alternative solution for this problem is to encapsulate the logic into a stored procedure and pass the dates as parameters that explicitly use the DATE data type, as follows:

-- Script Q-3

CREATE PROCEDURE [SumQuantityByDateRange] @begDate DATE, @endDate DATE

AS

SELECT SUM([Quantity])



FROM [SalesByDate]

WHERE [Date] BETWEEN @begDate AND @endDate

Example stored procedure execution:

-- Script Q-4

DECLARE @begDate date = GETDATE()-7

DECLARE @endDate date = GETDATE()

EXEC [dbo].[SumQuantityByDateRange] @begDate, @endDate

This approach enforces the proper data type usage for the partitioned table query.


Queries Indirectly Based on a Partitioned Column


For this example, we’ll use a table called Sales, which has 40 fixed partitions also based on month. However, in this case, the partition function uses a data type of smallint, which is derived from the following date-based expression:

(((DATEPART(yyyy,[Date])*12)+(DATEPART(mm,[Date])))%40)+1 AS [Partition]

This hash formula returns a number between 1 and 40, which is then included as a column in the fact table. The associated partition function in this scenario is as follows:

CREATE PARTITION FUNCTION [Sales_PartitionFunction](SMALLINT)

AS RANGE LEFT FOR VALUES

( 1, 2, 3, 4, 5, 6, 7, 8, 9,

10, 11, 12, 13, 14, 15, 16, 17, 18, 19,

20, 21, 22, 23, 24, 25, 26, 27, 28, 29,

30, 31, 32, 33, 34, 35, 36, 37, 38, 39 )

Note: This scheme is useful for adopting a monthly rotating window partitioning strategy where the partitions are static and eventually reused based on a date expression. Using this design, the sole maintenance requirement is to switch out the oldest partition (potentially, a very fast metadata-only operation) before the partition formula cycles back around to reuse a previously populated partition.

A consequence of having an indirect relationship between the partitioned column and, in this case, a date-based query predicate is that the optimizer will not be able to automatically use partition elimination because of the indirect predicate relationship. As an example, let’s look at the query we used earlier to access the minimum number of partitions, but this time, modified to use the Sales table with the above partition scheme:

-- Script Q-5

SELECT SUM([Quantity])

FROM [Sales]

WHERE [Date] BETWEEN CONVERT(DATE, GETDATE()-7)

AND CONVERT(DATE, GETDATE())

Viewing the execution plan for this query, you can see that the Clustered Index Seek operation accesses all 40 partitions, even when being sure to match the data type in the predicate. This is because the relationship of the date predicate to the derived partition column is not recognized by the optimizer. As you can see, you will need to directly code optimal partition elimination to effectively tune this query. To do so, you will need to include the partition column in the query predicate.

To set the stage for this query modification, let’s take a closer look at the physical partitions for the Sales fact table using the following query:

-- Script Q-6

SELECT *

FROM sys.partitions

WHERE OBJECT_NAME(OBJECT_ID)='Sales'

ORDER BY [index_id], [partition_number]

The result set in Figure 5-7 details each table partition. Notice that column 4 lists the actual partition number, and column 6 lists the number of rows for each partition.



Figure 5-7: Sample results from sys.partitions

Next, you can include the $Partition function in the WHERE clause to view the contents of a single partition. For example, the following query lists only the rows contained in partition number 3:

-- Script Q-7

SELECT TOP 1000 *

FROM [dbo].[Sales]

WHERE $PARTITION.Sales_PartitionFunction([Partition]) = 3

To assist in routinely determining the correct partition number, you will probably want to create a scalar function that is based on the partition expression used earlier. This will help you easily return the proper partition number that can be used for routine queries:

-- Script Q-8

CREATE FUNCTION [dbo].[GetSalesPartition] (@date DATE)

RETURNS SMALLINT

AS

BEGIN


RETURN ((((DATEPART(yyyy,@date)*12)+(DATEPART(mm,@date)))%40)+1)

END


Leveraging these techniques, you can now alter the original query and benefit from partition elimination:

-- Script Q-9

SELECT SUM([Quantity])

FROM [Sales]

WHERE [Date] BETWEEN GETDATE()-7 AND GETDATE()

AND $PARTITION.Sales_PartitionFunction([Partition])

IN ( [dbo].[GetSalesPartition] ( GETDATE()-7 )

, [dbo].[GetSalesPartition] ( GETDATE() ) )

At this point, you may be wondering why not simply use the Partition column directly, instead of the more verbose $PARTITION.Sales_PartitionFunction([Partition]) function, as shown in the following query:

-- Script Q-9

SELECT SUM([Quantity])

FROM [Sales]

WHERE [Date] BETWEEN GETDATE()-7 AND GETDATE()

AND [Partition]

IN ( [dbo].[GetSalesPartition] ( GETDATE()-7 )

, [dbo].[GetSalesPartition] ( GETDATE() ) )

One reason, which happens with the above query, is that the optimizer misses partition elimination unless you use the $PARTITION function. Another important reason is that value of the Partition column will not necessarily equate the physical partition number returned by the $PARTITION function. For example, if the partitioning strategy for this table used a partition function of RANGE ‘RIGHT’ instead of RANGE ‘LEFT’, the actual value of the Partition column would be offset by 1 when compared to the number returned by the $PARTITION function.

As you can see, additional considerations are required when querying partitioned tables. Attention to the details of query plans, such as partition usage, can save many hours of run time and improve response times for all concurrently executing queries in your data warehouse environment.


Querying with Partitioned Indexes


In the same way that partitioned tables require additional query consideration, so do the indexes on these tables. Fortunately, unless you specify a different partition scheme or a separate filegroup, indexes created on partitioned tables will automatically be aligned with the underlying table partitions. Building on the earlier example, this means that if a table has 40 partitions, so will the accompanying indexes for that table. This is a clear benefit for data management language (DML) operations, parallel execution plans, and partition maintenance (i.e., SWITCH, SPLIT, and MERGE), but it may also introduce query performance issues.

As an example, let’s compare the same aggregate query on two tables: the SalesByDate table, which is partitioned by date, and the SalesNP table, which is identical to the first table except that it is not partitioned. In the following query, our goal is to use a non-clustered, non-unique index (defined in both tables) that contains a single column (StoreID) in ascending order. The query will look for the total sale count for store number 50 during the previous month. Because we want to know the database read performance for this query, let’s begin by using the following SET command:

SET STATISTICS IO ON

Now run the following query on the non-partitioned table to get a benchmark of the read performance:

-- Script Q-11

SELECT COUNT(*) AS [SaleCount]

FROM [SalesNP]

WHERE [StoreID] = 50

AND [Date] BETWEEN DATEADD(m, DATEDIFF(m, 0, GETDATE()) - 1, 0)

AND DATEADD(m, DATEDIFF(m, 0, GETDATE()), 0) -1

The value returned for SalesCount will vary depending on the generated test data, but our interest here is in query cost and total disk I/O. The graphical query plan in Figure 5-8 shows a single non-clustered Index Seek operation having a .003 estimated subtree cost.



Figure 5-8: Cost of Index Seek operation

In this case, the index used is named IX_SalesNP_StoreID, which is our intention. If you click the messages tab in SSMS, it reveals that only two logical reads were required to complete the query. This is pretty impressive knowing that the table itself has 100,000 rows. One reason for this favorable performance is that SQL Server includes the clustering key for secondary indexes defined on tables with a clustered index (recall that the tables in this example are clustered on the Date column).

The benefit of having the clustering key included in each row of the secondary indexes is that it expands the covering effect for queries based on the index, which is arguably one of the most effective performance tuning techniques available to developers. The only downside is the additional overhead of supplementing all secondary indexes with the clustering key.

Now that you have a good baseline for this query, you can run it again using the SalesByDate table, which is partitioned by date:

-- Script Q-12

SELECT COUNT(*) AS [SaleCount]

FROM [SalesByDate]

WHERE [StoreID] = 50

AND [Date] BETWEEN DATEADD(m, DATEDIFF(m, 0, GETDATE()) - 1, 0)

AND DATEADD(m, DATEDIFF(m, 0, GETDATE()), 0) -1

In this case, there are 66 logical reads and an estimated sub-tree cost of .131 for the same Index Scan operation. While it’s true that two logical reads versus 66 is not a big concern, it is 33 times greater than the baseline I/O performance of the same secondary index on a non-partitioned table. It’s also worth noting that the more partitions a table has, the greater this problem will be. When extrapolated for a fact table containing billions of rows, this cost may be significant.

So let’s dive into what happened with the partitioned side of this query. The first thing to note when viewing the query plan, shown in Figure 5-9, is that it accessed 40 partitions of a secondary index called IX_SalesByDate_StoreID.





Figure 5-9: Index Seek (Non-Clustered) - Seek Predicates

At first glance, this appears to make sense because the secondary index on StoreID is partition aligned, thereby creating 40 distinct sections to this index. On the other hand, the partitions are aligned by the Date column, which is included in the query predicate. This should automatically invoke the optimizer’s partition-elimination capabilities.

A closer examination of the Seek Predicates in the operation above reveals that there are two Seek Keys being used. The first, Seek Key [1], is used to qualify the specific partition, and the second, Seek Key [2], is used to filter on both StoreID and Date. This new operation in SQL Server 2008, known as Skip Scan, provides for two levels of search conditions in a single operation. One seek is used for the partition number, and the other seek is used for a secondary search condition. The Skip Scan operation is also an indication that the query may potentially access all partitions. In the case above, seven partitions are completely empty, so the 66 reads are based on two reads each of the 33 populated partitions.

To improve performance, similar to our earlier optimization with the clustered index, we need to explicitly convert the embedded GETDATE() function into the DATE data type, conforming to the data type of the partition function:

-- Script Q-13

SELECT COUNT(*) AS [SaleCount]

FROM [SalesByDate]

WHERE [StoreID] = 50

AND [Date] BETWEEN CONVERT(DATE, DATEADD(m, DATEDIFF(m, 0, GETDATE()) - 1, 0))

AND CONVERT(DATE, DATEADD(m, DATEDIFF(m, 0, GETDATE()), 0) -1)

This technique ensures that we achieve partition elimination and access only one partition of this secondary index, as illustrated in Figure 5-10.



Figure 5-10: Example of targeting a single index partition

This modification reduces the query overhead to just two reads, providing the same performance in this case as accessing a non-partitioned table index.


Queries on a Partitioned Heap


As its name implies, there is no specified order for the data provided in a heap. However, if the table participates in a partitioning scheme, you do have some degree of selectivity available, at least to the extent of the granularity defined by your partition function. With a total of 1000 partitions available per table, you could choose to have nearly 3 years of data partitioned by day. This approach would partially simulate a clustered index on Date without the related index overhead.

To help evaluate this option, let’s use another monthly partitioned table called SalesByDateHeap, which is the same as the SalesByDate table but without a clustered index. The secondary index based on StoreID remains on the heap. However, keep in mind that the Date column is not automatically included in the secondary index, as was the case with the SalesByDate clustered table used in the last example.

Because the following query is aligned perfectly with the monthly partition scheme, the omission of the Date column from the secondary index, IX_SalesByDateHeap_StoreID, has no effect on query performance:

-- Script Q-14

SELECT COUNT(*) AS [SaleCount]

FROM [SalesByDateHeap]

WHERE [StoreID] = 50

AND [Date] BETWEEN CONVERT(DATE, DATEADD(m, DATEDIFF(m, 0, GETDATE()) - 1, 0))

AND CONVERT(DATE, DATEADD(m, DATEDIFF(m, 0, GETDATE()), 0) -1)

The Messages tab in SSMS reports that this query took only two logical reads, providing the same benefit as the clustered version of this table. By examining the execution plan, which Figure 5-11 shows, you can also confirm that the secondary index (IX_SalesByDateHeap_StoreID) was used and that only one partition was accessed.





Figure 5-11: Index Seek (Non-Clustered)

It’s worth noting here, at least for this chapter’s modest set of test data (100,000 rows), that there was close to a 20% decrease in overall disk I/O when populating the SalesByDateHeap table compared with the disk I/O required to populate the SalesByDate clustered index table. This isn’t to imply that heaps are always faster or that you should avoid using a clustered index with partitioned tables. However, you may want to consider using a heap, along with supporting secondary indexes, especially if the granularity of the partitions partially imitates the ordered sequence of the clustered index. In the case of heap fragmentation (less likely with fact tables), SQL Server 2008 now also provides a method to rebuild heaps.


Covering Indexes


As mentioned earlier, covering indexes are a proven technique for boosting SQL query performance, and this holds true in both OLTP and data warehouse environments. As a quick review, a covered index is an index that contains all required columns from an underlying table needed for all or part of a query. The primary reasons for the performance improvement with covering indexes are:

  • Row look-ups for additional columns to the primary table are avoided

  • The order and selectivity of the secondary index may benefit the query

  • The index row width is usually much narrower than the actual table (clustered or heap), resulting in fewer database reads

Include Column Option

Beginning with SQL Server 2005, secondary indexes can take advantage of the INCLUDE clause to add non-key columns to the leaf level of an index. The good news is that this technique increases the likelihood that your index will cover a query and has less maintenance overhead than appending the column(s) as a larger, composite key. It also provides for including LOB data in an index, such as varchar (max). On the downside, this approach increases the width of the index, resulting in fewer index rows included with each database read.

Taken to the extreme, a covered index could consume the same disk space and overhead of the table itself. In a data warehouse environment, this could be significant. Consequently, you shouldn’t add columns to an index without first evaluating the proportion and relative cost of queries that would actually benefit from the broader index.

Filtered Indexes


SQL Server 2008 has introduced a new filtered index feature that can dramatically optimize queries that have an aligned predicate. For example, let’s say that a sales quantity of less than zero indicates a product return, and let’s further assume that there are several common queries used to manage these returns. In this scenario, we can anticipate that queries may often have the predicate of ‘[Quantity] < 0’. To understand the potential query impact, let’s start by benchmarking two queries: one based on a clustered index and one based on a heap:

-- Script Q-15

SELECT COUNT(*) AS [ReturnCount]

FROM [SalesByDate]

WHERE [Quantity] < 0

SELECT COUNT(*) AS [ReturnCount]

FROM [SalesByDateHeap]

WHERE [Quantity] < 0

Using feedback from STATISTICS IO, you can see that the first query, based on a clustered index, took 501 reads and the second query, based on a heap, took 363 reads (SQL Server Profiler reported 501 and 427, respectively).

Now let’s add to both tables a partitioned-aligned filtered index, which will use the predicate of ‘[Quantity] < 0’:

-- Script Q-16

CREATE NONCLUSTERED INDEX [FIX_SalesByDate_StoreID]

ON [dbo].[SalesByDate] ([StoreID])

WHERE [Quantity] < 0


CREATE NONCLUSTERED INDEX [FIX_SalesByDateHeap_StoreID]

ON [dbo].[SalesByDateHeap] ([StoreID], [Quantity])

WHERE [Quantity] < 0

Now let’s rerun the original two queries and see if there was an associated performance benefit. Having the filtered indexes in place, both of the abovementioned queries resulted in a total of 99 logical reads, or up to an 80% improvement in disk I/O. Another benefit to the filtered index is that, in this case, it only occupies about 22% of the space of a similar non-filtered index (i.e., without using the WHERE clause).



Note: The Quantity column was added to the filtered index for the heap. This is a workaround because the optimizer bypasses this filtered index on a heap when this column is excluded.

Filtered indexes are also beneficial to use on columns with a high percentage of null values, letting you exclude those rows from the index. This of course would need to align with queries filtering on the same column and also excluding nulls.


Indexed Views


As mentioned earlier, join costs tend to be very high in a data warehouse, which is why this expense is often offset by using a de-normalized schema design. Another way to avoid this cost, especially if the underlying tables are relatively static, is to use indexed views. This approach materializes the result set of a view that:

  • Uses SCHEMABINDING and

  • Is supplemented by a unique clustered index

This technique also works well for views that aggregate data because the materialized storage overhead is typically much smaller than that for the original base tables.

Note: When SCHEMABINDING is specified, the base table(s) cannot be changed in a way that would alter the definition of the Indexed View.

To see an indexed view in action, let’s benchmark a query that performs an aggregation after inner-joining two tables together:

-- Script Q-17

SELECT DATEPART(yyyy,[Date]) AS [Year]

, SUM([ListPrice]) AS [TotalListPrice]

FROM [dbo].[SalesByDate] s

JOIN [dbo].[Product] p

ON s.[Item] = p.[Item]

GROUP BY DATEPART(yyyy,[Date])

According to Profiler, this query runs with our limited test data in .3 second and takes 746 logical reads. You can improve this performance by first creating an indexed view based on this same query:

-- Script Q-18

CREATE VIEW [dbo].[IV_TotalListByYear] WITH SCHEMABINDING AS

SELECT DATEPART(yyyy,[Date]) AS [Year]

, SUM([ListPrice]) AS [TotalListPrice]

, COUNT_BIG(*) AS [CountBig] -- required to index the view

FROM [dbo].[SalesByDate] s

JOIN [dbo].[Product] p

ON s.[Item] = p.[Item]

GROUP BY DATEPART(yyyy,[Date])

You can then add a unique clustered index on this schema-bound view:

-- Script Q-19

CREATE UNIQUE CLUSTERED INDEX CIX_TotalListByYear

ON [dbo].[IV_TotalListByYear] ([Year])

Now the cost for running the original “join and aggregate” query, or alternatively the cost of simply selecting directly from the new indexed view, is reduced to only two reads and zero measured duration. Obviously, that’s a big improvement, but keep in mind there is a cost to creating and maintaining indexed views. The more static your data (which may be the case between during the day for some data warehouses), the more practical it is to use indexed views. On the other hand, constant updates to the underlying base tables may preclude the use of this method for improving query performance.



Note: If you are not using Enterprise, Data Center or Developer editions of SQL Server, there are two caveats to consider when using indexed views:

  1. The query hint NOEXPAND must be included to ensure that the indexed view is not expanded to the underlying tables by the query optimizer.

  2. The indexed view must be directly referenced by name in the query. If not, the optimizer will not consider using the indexed view as demonstrated in the example above.

Routine Aggregation Using Summary Tables in the Data Warehouse


A related strategy to using indexed views is the creation of summary tables to feed reports or Key Performance Indicators (KPIs) that are routinely displayed, perhaps on dashboards or organizational Web sites. This process would typically be included as a late step of a daily ETL process and/or updated on a scheduled basis.

Of course, the use of summary tables will not provide the same currency as live data, but in a data warehouse environment, you typically have some built-in scheduled latency due to the timing of various data feeds and ETL packages. To provide clarity on the timeliness of downstream reporting, you can add a refreshed date/time column to a summary table, which could then be used to qualify your reports, KPIs, and other data visualizations.


Conclusion


De-normalization, table partitioning, and a variety of indexing tactics can all help significantly improve query performance in a data warehouse. Table partitioning can also streamline bulk maintenance tasks such as switching out old partitions via fast metadata-only operations. However, because these techniques may require additional overhead, either for the database developer (object creation and maintenance) or for SQL Server itself, take care not to overemploy these strategies.

One way to determine the appropriate use of performance tuning techniques is to monitor your SQL Server environment for spikes in server resource utilization and correlate this information with reoccurring query patterns. We’ll cover this topic in the next section.


1   2   3   4


The database is protected by copyright ©essaydocs.org 2016
send message

    Main page