Chapter 5: Querying, Monitoring, and Performance Tuning



Download 288.45 Kb.
Page4/4
Date conversion29.04.2016
Size288.45 Kb.
1   2   3   4

Performance Tuning

Overview


After ensuring accuracy and reliability, performance is the most critical objective for queries running in a data warehouse. Due to very large data volumes and unpredictable query patterns, it is common for performance complaints to surface in the data warehouse server environment.

Because we like easy solutions, a common response to performance issues involves acquiring a larger server. While this usually does provide faster response times, it’s not the most economical or greenest solution. It also tends to mask core issues with the database schema, indexing strategy, reusable database objects, and/or query design—all problems that tend to resurface even after procuring bigger hardware. Additionally, the expense and migration headache associated with server and SAN upgrades can be considerable, especially when code and schema optimization can often bring dramatic improvements in query performance.

The concepts and techniques presented in this section are geared more toward query optimization, rather than overall server optimization. The emphasis is on identifying recurring problematic SQL code and applying a variety of query tuning techniques. See Chapter 4 for recommendations about overall server- optimization practices.

Targeted Performance Tuning


A fairly high percentage of SQL code submitted for execution to a database server could be optimized, at least to some degree. This doesn’t mean, however, that it makes sense to performance-tune all code that is developed for the database. Shaving a mere 4 milliseconds off a query that runs once a day is an obvious waste of time. However, taking 30 seconds off a stored procedure that runs 200 times a day is significant, both in terms of server capacity management and user wait times.

The best way to ensure that you have targeted the most problematic code is to base your efforts on the analysis from your monitoring applications, as discussed in the last section. The objective is to identify SQL code that is repeatedly run, either as a stored procedure or a batch script, and that consumes the most server resources. This approach should provide the highest return on investment (ROI) for your tuning efforts.

The three most common measures to examine for this exercise are:


  • Disk I/O

  • CPU Time

  • Run Times (Duration)

The Disk I/O and CPU Time are metrics that can be directly attributed to a single query. Duration, on the other hand, may indicate a problem with a given query or perhaps a more complex problem involving resource contention on the database server. In a data warehouse environment, it’s not unusual for there to be tremendous pressure on Disk I/O simply due to the volume of data needed for ad hoc and deployed queries. A good practice is to use one of the monitoring techniques discussed in the last section and look for recurring queries with a high number of database reads (or overall disk I/O).

The CPU Time measure in a data warehouse often parallels the Disk I/O measure in that a reduction in reads will typically have a corresponding drop in CPU. However, certain techniques have an inverse relationship involving these two measures. An example of this is table compression, where you would most likely have a positive effect on Disk I/O with more data per page but a negative effect on CPU due to handling the compression routine. Another sometimes counterintuitive scenario is when you upgrade to faster disk storage and then find your system has an increase in CPU usage. This of course is due to faster I/O, and you should still see a net gain unless your CPUs were already under great pressure.

In the situation where a trade-off exists, you should consider overall server performance counters (available in MDW reporting,) and reduce demand for the resource that is usually under the greatest pressure.

Query run times should of course be monitored and targeted for performance tuning, but with an eye toward concurrently running queries. For example, a large UPDATE statement causing an extended run time for one query may reflect poorly on the duration of another query that tries to concurrently SELECT from the same table. Diagnosing the second query outside of the context of the first would likely hide the real problem of locking and blocking. In addition, heavy activity due to one or more very large queries (deployed or ad hoc) can make all queries appear to run poorly.



Note: One way to spot this type of locking and blocking is to query the [snapshots].[trace_data] table, looking for a large number of queries that have a similar EndTime—perhaps within 500 milliseconds of each other. A close-to-common end time often indicates that a large blocking query finally completed and the corresponding blocked queries were then able to quickly complete.

Query Tuning Based on Disk I/O


To demonstrate examples of performance tuning, let’s take advantage of the MDW, which if configured as illustrated in the previous section, should contain a substantial amount of performance-related data for the instances being tracked. Our first target will be SQL code with a large number of (logical) reads as recorded by the SQL Trace data collector. To determine this, we will group the SQL text by the first 80 characters as a simple (albeit imperfect) attempt to spot recurring dynamic SQL code:

-- Script PT-1

SELECT TOP 10 LEFT([TextData],80) AS [TextShort]

, COUNT(*) AS [TimesRun]

, AVG([Reads]) AS [AvgReads]

, SUM([Reads]) AS [SumReads]

, MIN([Reads]) AS [MinReads]

, MAX([Reads]) AS [MaxReads]

FROM [snapshots].[trace_data]

GROUP BY LEFT([TextData],80)

HAVING COUNT(*) > 4

ORDER BY [SumReads] DESC



Note: You may want to vary the number of characters used with the LEFT character function to better match your most common T-SQL query patterns.

The above code should help identify the most expensive queries based on database reads. If you need to see the full TextData column for the abbreviated T-SQL code (i.e., text beyond 80 characters), you can re-query the trace_data table with a predicate based on the returned MinReads and/or MaxReads values. This method is required because there is no unique row identifier in the [snapshots].[trace_data] table.

To continue with tuning based on disk I/O, the query above can double as a query that itself needs to be tuned, since it takes a fairly long time to run. As in earlier examples, you want a get a benchmark of the query targeted for tuning so that you can track your optimization progress. By specifying the following SET options, you will be able to see the needed query feedback metrics under the Messages tab in SSMS:

SET STATISTICS IO ON

SET STATISTICS TIME ON

Also make sure you right-click in the design surface and choose Include Actual Execution Plan before rerunning the query, as Figure 5-29 shows, so that you see a graphical view of the query plan.





Figure 5-29: Including Actual Execution Plan in SSMS

After rerunning the query above, you can refer to the statistical feedback in the SSMS Messages tab (shown in Figure 5-30) to see that the query used a total of about 440,000 reads, with a CPU time of roughly 25,000 ms (milliseconds).





Figure 5-30: Results Pane - Query Messages

By clicking the Execution plan tab in SSMS, you can also see a graphical representation of the query plan, which Figure 5-31 shows.





Figure 5-31: Graphical execution plan in SSMS

Note: Later in this section, we’ll look at other options for obtaining details about execution plans. SSMS also lets you view an estimated execution plan (using the same drop-down menu as displayed above), which provides a very similar plan without the need run the actual query.

By scanning the above execution plan from right to left, you can see a steady decrease in arrow width between operations. This indicates a gradually reduced number of rows handled from step to step. The right-most operation, a Clustered Index Scan, also happens to carry the largest operational cost, which is 98% of the overall query.

You can hover over the icons in the execution plan to obtain more details about the operation. For example, hovering over the Clustered Index Scan icon displays the metrics for the operation, as Figure 5-32 shows.



Figure 5-32: Details for Clustered Index Scan operation

In addition, hovering over an arrow in the execution plan provides the number of rows going into or out of a given step, as Figure 5-33 shows. Note how the arrows grow wider as the rows involved in the specific operation increase.





Figure 5-33: Number of rows between operations

The best place to start when considering strategies for optimization is to focus on operations with a high percentage cost as well as wide arrows between query steps. The greatest expense by operation in this case, as with many data warehouse queries, is in scanning a large table or index. With this particular query, most of the cost occurs with scanning the clustered index (IDX_trace_data _EventSequence) for the specified query columns. Because this operation carries virtually all the cost of this query (98%), you would need to address this trace_data clustered scan operation to have any real effect on the performance of this query.



Optimization Strategy

Following are a few options to consider for optimizing a large clustered index scan:



However before pursuing a specific option, make sure you consider the objectives of the query itself. In some cases, clarifying the purpose of a targeted query may provide additional tuning opportunities. For example, in this case, you are looking for queries in need of performance tuning from an I/O (read) perspective. However, a casual browsing of the trace_data table shows that most rows have a very small value in the Reads column. These rows have no meaningful consequence to the original intent of this query, which is to find recurring queries with a very high amount I/O.

This insight and clarification of the query objective suggests two potential refinements to the original optimization strategies:



  1. A new filtered index where the Reads column is greater than a certain value

  2. A new or modified index with the Reads column as the first column

Based on the data distribution of the Reads column, a filtered index is probably the best choice, especially if you set the filter threshold fairly high. This will ensure that the index will always have far fewer rows than a comparable non-filtered index. Another decision to make before creating the filtered index is to determine which columns to house within the index, either as a part of a composite index that begins with the Reads column or simply as an included column for the index.

Another key consideration for the filtered index is the Reads threshold to use in order to be included in the index. The lower the threshold, the more rows from the base trace_data table will be placed in the index. A lower threshold also means more pages required for facilitating this index as well as greater overall maintenance costs. Fortunately for our purpose, we are looking for queries with an exceptionally large number of reads, which argues for a higher threshold. This will also translate to fewer index rows, providing higher selectivity when using the index and lower maintenance costs. For this exercise, let’s restrict the new filtered index to include only those rows with an excess of 50,000 reads for the queries being traced.

For the query in question, the only two columns you need to make this a covering index are Reads and TextData. Because TextData has a data type of NVARCHAR(MAX), it cannot be used as part of a composite index. It can, however, be placed after the INCLUDE clause, providing a covering dynamic with queries referencing the TextData column:

-- Script PT-2

CREATE NONCLUSTERED INDEX [FIDX_trace_data_Reads_incl_TextData]

ON [snapshots].[trace_data]

( [Reads] ) INCLUDE ( [TextData] )

WHERE ([Reads] > 50000)

This index can now be automatically leveraged for any queries on the trace_data table with a predicate based on Reads greater than 50,000. Once you add this predicate to the original query, it’s almost guaranteed to reduce the number of returned rows.

Note: With the exception of MaxReads, the aggregate values will probably differ because the query will likely be based on a much smaller dataset. Despite this, we are still gaining the most actionable data from the query and staying true to our intent of finding queries that have a very high number of database reads and, therefore, a high impact on disk I/O.

Now let’s rerun the query, being sure to align the WHERE clause with our new filtered index of Reads > 50000:

-- Script PT-3

SELECT TOP 10 [TextDataShort]

, COUNT(*) AS [TimesRun]

, AVG([Reads]) AS [AvgReads]

, SUM([Reads]) AS [SumReads]

, MIN([Reads]) AS [MinReads]

, MAX([Reads]) AS [MaxReads]

FROM [snapshots].[trace_data]

WHERE ([Reads] > 50000) --  add to align with filtered index

GROUP BY [TextDataShort]

HAVING COUNT(*) > 4

ORDER BY [SumReads] DESC

This revised query now requires 85 reads (compared to 440,000) and takes only 31 ms of CPU time (compared to 25,000 ms). This represents a major performance improvement, with the only additional overhead being the maintenance of the new filtered index, which should hold a small fraction of the total rows in the trace_data table.

To see how many rows are included in the new filtered index, you can use the following query:

-- Script PT-4

SELECT COUNT(*) FROM [snapshots].[trace_data] WHERE ([Reads] > 50000)

Ideally, this number would be less than a 1,000 because you are looking for worst-case queries. You will want to adjust the filtered index threshold placed on Reads up or down from 50,000 to best fit the size of your large data warehouse queries. For large databases, you may find 500,000 is a better number to use in the filtered index. In any case, the objective is to keep enough rows to obtain actionable information while adding the least amount of database overhead.

Working with Execution Plans


While you are optimizing queries, you may find it beneficial to store query plans that can be used for later reference. For example, it may be helpful to compare execution plans before and after optimizing a query. This may help you become more familiar with the operations SQL Server uses and how they can affect performance. For problematic SQL code, it’s also a handy to have a well-performing query plan stored. Then in the event that the same query starts to run poorly, you can see what has changed from a query plan perspective. Figure 5-34 shows a sample graphical query plan, as we’ve seen earlier in this chapter.



Figure 5-34: Graphical query plan

Execution plans are also available in XML and other text modes. You can use the SET option in SSMS to obtain text-based versions of execution plans, as follows:



  • SET SHOWPLAN_XML ON

  • SET SHOWPLAN_TEXT ON

  • SET SHOWPLAN_ALL ON

Figure 5-35 shows a limited screen shot of an XML-based query plan.



Figure 5-35: Text-based (XML) query plan

For more information about these and other Showplan SET options, see Displaying Execution Plans by Using the Showplan SET Options (Transact-SQL).

SSMS also has a nice feature to let you save a graphical execution plan in the XML file format. You are able to do this while viewing a plan graphically under the estimated or actual Execution plan tab. The saved XML file can later be reloaded into SSMS and displayed graphically as it was originally rendered.

Note: You can also collect and store SQL Plan data using Profiler. For more information see: http://blogs.techrepublic.com.com/datacenter/?p=269

To store an execution plan in XML format, simply right-click anywhere on the graphical display pane and provide a name and location for the .sqlplan file, as Figure 5-36 shows.





Figure 5-36: Saving an XML-formatted execution plan

Because these file reside in the file system, you have the option to query or transform these plans via a variety of XML script techniques, including XPath, XQuery, and XSLT.

For more details about XML Showplans in general, see XML Showplans.

Join Operators


The join operator is a key element of most query plans. There are three primary join operator types that SQL Server may choose to use when joining together tables:

  • Nested Loops

  • Merge

  • Hash Match

Using complex mathematical algorithms based on schema design, actual data, keys, indexes, and metadata, the SQL Server optimizer will normally choose the best join operator to use in a query plan. However, the optimizer is not perfect in making this determination, and in the event of a performance problem, you may want to identify the join operator chosen for a given query step and know how to alter if necessary.

Nested Loops Join Operator

To get started with our look at the Nested Loops join operator, be sure your SET STATISTICS statements are on:

SET STATISTICS IO ON

SET STATISTICS TIME ON

And also make sure to include the actual execution plan for the current connection, as Figure 5-37 shows.



Figure 5-37: Including the query execution plan

You’re now ready to run the following query, which illustrates a join operator:

-- Script PT-5

SELECT pci.performance_counter_id

, pcv.collection_time

, pci.counter_name

, pci.instance_name

, pcv.formatted_value

FROM snapshots.performance_counter_instances pci

INNER JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

WHERE pci.[performance_counter_id] = 2 -- Avg. Disk Queue Length

The results of this query show the values that have been captured in the MDW for the system performance counter of Avg. Disk Queue Length. We will refine this query later to be a little more helpful, but for now, let’s stay on track and examine the join operator. In the results pane, click the Execution Plan tab to see the join operator selected by the optimizer, as Figure 5-38 shows.



Figure 5-38: Nested Loops join operator

In this case, you can see that the optimizer chose the Nested Loops join operator. This operator is driven first by a top or outer input of qualifying rows, which in this example is just one row:

WHERE pci.[performance_counter_id] = 2

It is then followed by a bottom or inner scan of matching rows to satisfy the join predicate:

INNER JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

Because each qualifying row in the outer table requires an inner scan, the Nested Loops join works best with a small number of rows in the outer, or top, loop. Now let’s try another query—one that pulls back two performance counters:

-- Script PT-6

SELECT pci.performance_counter_id

, pcv.collection_time

, pci.counter_name

, pci.instance_name

, pcv.formatted_value

FROM snapshots.performance_counter_instances pci

INNER JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

WHERE pci.[performance_counter_id] IN (3,7) -- Avg. Read and Write Queue

The plan for this query should still use the Nested Loops join operator. However, your results could vary because this query appears to border on using the Merge join operator. For purposes of this exercise, you can force the Nested Loops operator (if needed) by including the following hint inside the INNER JOIN clause:

INNER LOOP JOIN snapshots.performance_counter_values pcv

The updated query execution plan should also show the Nested Loops join operator. However in this case, as Figure 5-39 shows, this join operator has a higher overall operation percentage cost (45%) than it did in the previous query, which was 34%. This result indicates that as the number of rows in the outer loop increases, the less efficient the Nested Loops operator may become due to the increased inner (bottom) loops that are required:





Figure 5-39: Nested Loops join operator (continued)

To verify the number of inner loops, you can also look on the Messages tab in SSMS to see that two scans are now required to satisfy the join operation:

Table 'performance_counter_values'. Scan count 2, logical reads 61, physical reads 0 …

Table 'performance_counter_instances'. Scan count 2, logical reads 4, physical reads 0 …



Merge Join Operator

To demonstrate the Merge Join operator, let’s alter the query slightly, this time looking for all performance counters with an ID less than 5:

-- Script PT-7

SELECT pci.performance_counter_id

, pcv.collection_time

, pci.counter_name

, pci.instance_name

, pcv.formatted_value

FROM snapshots.performance_counter_instances pci

INNER JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

WHERE pci.[performance_counter_id] < 5

When you examine the new execution plan, you’ll see that the optimizer has switched to the Merge Join operator, as Figure 5-40 shows.



Figure 5-40: Merge Join operator

If your query plan results differ from the above, you can force the Merge Join operator by including the following hint in the INNER JOIN clause:

INNER MERGE JOIN snapshots.performance_counter_values pcv

Unlike the Nested Loops join operator, where the inner table may need to be scanned multiple times, the Merge Join operator scans data just once. This is clearly an advantage for the Merge Join operator, but it also requires that the tables involved be sorted by the join predicate. After closer examination, you can find that these tables both have a clustered index that begins with the columns named the JOIN clause:

FROM snapshots.performance_counter_instances pci

JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

This makes the Merge Join a natural choice for this particular query.



Note: The optimizer may choose to first create a worktable that is sorted based on the join predicate in order to take advantage of the Merge Join operator.

Because the sample queries above operate very close to the optimizer’s threshold between the Nested Loops and Merge Join operators, you shouldn’t see much difference in the I/O or CPU costs by tweaking these queries with a table join hint. The important thing to understand is the characteristics of the join type being used and the technique used to override the join choice, if necessary.



Hash Match Join Operator

To demonstrate the Hash Match join operator, we will change our earlier query by substituting a table named [core].[snapshots_internal]. With this code revision, the optimizer will not have a convenient clustered or secondary index to use when considering the join predicate for this query:

-- Script PT-8

SELECT pcv.collection_time

, pcv.formatted_value

, si.snapshot_time_id

FROM [core].[snapshots_internal] si

INNER JOIN snapshots.performance_counter_values pcv

ON si.[snapshot_id] = pcv.[snapshot_id]

WHERE pcv.performance_counter_instance_id < 5

After running this query, you should see that the execution plan contains the Hash Match join operator, shown in Figure 5-41.



Figure 5-41: Hash Match join operator

For this join exercise, if your plan does not reflect the Hash Match operator, you can force it by including the following hint in the JOIN clause:

INNER HASH JOIN snapshots.performance_counter_values pcv

The Hash Match join operator is the most complex of the three join operators and works based on a hash function that is applied to each row in both the outer (top) and inner (bottom) inputs. For the very large tables in a data warehouse, the Hash Match operator may put pressure the CPU to perform the hash function and may place considerable pressure on memory depending on the cardinality of the hash output. This is because all the distinct hash values must be held in memory or paged to tempdb while the Hash Match operation completes.

The good news for the Hash Match operator is that the tables are scanned only once, providing a similar benefit to the Merge Join operator.

For a helpful article on the internals of the Hash Match join operator, see Physical Join Operators in SQL Server - Hash Operator.

As noted earlier, the SQL Server optimizer normally selects the most efficient join operator. However, if you have a query that is in need of attention due to performance issues, it’s helpful to verify that the query plan is using the best join method for the operation. If the optimizer chooses a less-efficient join operator, you may decide to supplement the query with a table join hint. If you choose this approach, make sure you document where you have embedded query hints. This will allow you to more easily reexamine the performance effects of these supplemental hints with the release of new versions of SQL Server.

Now that you’re familiar with more of the tables maintained in the MDW database, you may find the next query helpful in trending one or more of the performance counters that are of interest. The following example provides a daily trend for the system performance counter of Average Disk Queue Length:

-- Script PT-9

SELECT pci.performance_counter_id

, CAST(pcv.collection_time AS DATE) AS [Date]

, pci.counter_name

, pci.instance_name

, AVG(pcv.formatted_value) AS [Avg Value]

FROM snapshots.performance_counter_instances pci

INNER JOIN snapshots.performance_counter_values pcv

ON pci.performance_counter_id = pcv.performance_counter_instance_id

WHERE pci.[performance_counter_id] = 2 -- Avg. Disk Queue Length

GROUP BY pci.performance_counter_id

, CAST(pcv.collection_time AS DATE)

, pci.counter_name

, pci.instance_name

ORDER BY pci.performance_counter_id, [Date]

You may want to enhance this query to filter out rows for non-peak times in your data warehouse. By the way, according to the execution plan for this last query, the optimizer decided on using the Nested Loops join operator.


Isolation Level Considerations


SQL Server has these basic transaction isolation levels, listed below in order of lowest to highest isolation level:

  1. Read Uncommitted

  2. Read Committed (default level), and Read Committed Snapshot

  3. Repeatable Read

  4. Serializable

The two highest levels, Repeatable Read and Serializable, are typically used for more sensitive operational database (OLTP) queries, where data integrity is a greater concern than query performance or user concurrency. Conversely, many data warehouse applications are able to take advantage of a more liberal, or lower-level, isolation strategy. This is because:

  • Data warehouses are typically just copies of underlying operational source systems, where critical transactional consistency is maintained.

  • Daily updates to the data warehouse are often batched together within ETL packages that are sequenced to run before the reporting processes and queries.

  • Many queries that are run in a data warehouse feed aggregate-level reports that are used more for rendering KPIs, sparklines, charts, and other trends, which are not necessarily sensitive to small anomalies in the underlying data.

The Read Committed isolation level is the default used by the SQL Server Database Engine. This level ensures that all pages being altered in the database are also committed before they can be read by another process. This isolation level can block a database read operation while waiting for a database update process to commit. This I/O constraint is a reasonable default, but at the same time, may be too pessimistic for certain data warehouse queries. SQL Server provides mechanisms to override this behavior, which we’ll discuss next.

Because Read Committed is the default isolation level, no explicit coding is required when querying directly on SQL Server. A variation of this default isolation level is Read Committed SNAPSHOT, which uses the most recent committed version of the data when there is lock contention, but it also requires greater resource consumption. For more information on row versioning based isolation levels see: http://msdn.microsoft.com/en-us/library/ms191190.aspx



Note: An upper layer within a multi-tier database application may choose to override SQL Server’s default Read Committed transaction isolation level.

The lowest isolation level of Read Uncommitted is considered highly optimistic, ignoring all locks that other queries and transactions may have temporarily placed in the database. As the name suggests, this level may return data that has not been formally committed to the database. Uncommitted data could occur during logical database operations such as a row update or a physical operation such as a page split.

Accessing these database pages is sometimes referred to as dirty reads. However, it is worth noting that dirty reads should not necessarily be equated to bad data since uncommitted pages have a reasonable chance of being swiftly committed to the database. Having said that, and because there are no guarantees when using Read Uncommitted, you should consider if the query in question can weather a small degree of uncertainty.

The Read Uncommitted isolation level must be explicitly coded and can be indicated for a given database connection by using a SET command:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

Alternatively, you can embed a NOLOCK table hint within a query, as demonstrated below:

-- Script PT-10

SELECT COUNT(*)

FROM [snapshots].[trace_data] (NOLOCK)

WHERE ([Reads] > 50000)

As mentioned earlier, using Read Uncommitted or the NOLOCK hint in a query does carry a higher risk of returning inaccurate data than using the default level of Read Committed. This is especially true if there is a high volume of data modifications occurring while your query is running. On the other hand, if the tables in your data warehouse are mostly static during query execution or if there is an allowance for imperfections within some of your aggregate queries, Read Uncommitted may be a valid option to consider.

For more information about isolation levels, see Isolation Levels in the Database Engine.


Best Practices Summary


Here is a summary of this chapter’s best practices for querying, monitoring, and performance tuning in your data warehouse:

  • Pay attention to both the results of a query and the execution plan designed to return those results. Make it a routine exercise to scan query plans to become more familiar with how SQL Server resolves queries.

  • De-normalize your data warehouse schema where possible to reduce the need to join very large tables.

  • Use table partitioning for your largest tables, and be sure when querying that the optimizer uses partition elimination whenever possible.

  • Align secondary indexes with the same table partition function to reduce partitioned table maintenance.

  • Consider using a heap instead of a clustered index, especially with a table that partially mimics a clustered index via its partition function.

  • Design indexes to be covering for larger recurring queries whenever possible, especially when creating a filtered (i.e., partial) index.

  • Use an indexed view as an alternative to a denormalized table or to materialize an aggregated dataset. The more static the underlying data, the less overhead will be required to maintain the indexed views.

  • Use summary tables with automatic (scheduled) updates as an alternative to indexed views, especially when there is less significance in the currency of the data.

  • For query monitoring, focus on the quantitative measures of Disk I/O, CPU Time, and Query Duration.

  • Use the features of the Management Data Warehouse for overall server monitoring, and use the SQL Trace data collection set for query monitoring.

  • Focus your tuning efforts on SQL code that is repeatedly run, either as a stored procedure or a SQL script, and that consumes the most server resources. Data warehouse applications tend to place the greatest pressure on Disk I/O.

  • For tuning and management of large queries, consider saving historical copies of XML execution plans to disk for later reference.

Conclusion and Resources


Querying in a data warehouse introduces complexities beyond those typically encountered in a standard OLTP database. The sheer volume of data to be loaded, processed, analyzed, and managed requires greater understanding of some of the internals and capabilities of SQL Server. Any database server, regardless of size, deployed in a data warehouse environment runs the risk of getting bogged down and even becoming unresponsive without the benefit of these fundamental data warehouse management practices.

This chapter only begins to frame the discipline of data warehouse querying, monitoring, and performance tuning. With this foundation, however, you can dig deeper into these topics, deepen your understanding with further research and testing, and verify and share the best practices you discover as you deploy your SQL Server data warehouse applications.


Additional Resources


De-normalization Concepts: http://en.wikipedia.org/wiki/Denormalization

Partitioned Tables: http://technet.microsoft.com/en-us/library/dd578580(SQL.100).aspx

Partitioned Indexes: http://msdn.microsoft.com/en-us/library/ms187526.aspx

Partition Table Maintenance (Sliding Window’):


http://blogs.msdn.com/b/menzos/archive/2008/06/30/table-partitioning-sliding-window-case.aspx

SQL Server’s MAXDOP setting and partitioned tables: http://blogs.msdn.com/b/sqlcat/archive/2005/11/30/498415.aspx

Covering Indexes: http://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/

Filtered Indexes: http://www.sql-server-performance.com/articles/dba/Filtered_Indexes_in_SQL_Server_2008_p1.aspx

Indexed Views: http://msdn.microsoft.com/en-us/library/dd171921(SQL.100).aspx

Client side versus server side traces: http://sqlserverpedia.com/wiki/The_Server-side_Trace:_What,_Why,_and_How

The Management Data Warehouse: http://msdn.microsoft.com/en-us/library/bb677306.aspx

SQL Server Utility: http://msdn.microsoft.com/en-us/library/ee210548.aspx

Query Execution Plans: http://www.simple-talk.com/sql/performance/execution-plan-basics

Join Operators:


http://www.sql-server-performance.com/articles/dba/pysical_join_operators_p1.aspx
http://www.sql-server-performance.com/articles/dba/physical_join_operators_merge_p1.aspx
http://www.sql-server-performance.com/articles/dba/physical_join_operators_hash_p1.aspx

Isolation Levels: http://msdn.microsoft.com/en-us/library/ms189122.aspx

Row versioning based isolation levels: http://msdn.microsoft.com/en-us/library/ms191190.aspx

SQL 2008 Data Compression: http://msdn.microsoft.com/en-us/library/dd894051.aspx


1   2   3   4


The database is protected by copyright ©essaydocs.org 2016
send message

    Main page