The data architect is typically the responsibility of the team that selects the platform architecture for a data warehouse. The final selection of a hardware and software platform depends on a combination of many factors, including performance and adherence to corporate standards. This section provides a brief overview of this topic; however, a detailed discussion of platform architecture is out of scope for this document.
Data warehouse data volumes and usage requirements require large dedicated hardware servers and storage hardware. Historically, organizations deployed SQL Server data warehouses by getting a big server with lots of CPUs and lots of memory and allocating lots of space in the Storage Area Network (SAN) for the data warehouse databases. However, alternative options have emerged over the past few years, as shown in Figure 2-34.
This illustration shows the following server options:
Server – The historical Symmetrical Multiprocessor (SMP) configuration is a multi-processor architecture with lots of memory running SQL Server Enterprise Edition.
Server virtualization – This configuration maximizes server resources by supporting the running of multiple virtual instances of SQL Server. This is a potential configuration for pre-production data warehouse environments, including test, staging, and QA.
Data warehouse appliance – SQL Server 2008 R2 PDW is a massively parallel processor (MPP) architecture supporting both centralized EDW and hub-and-spoke architectures on dedicated SAN storage.
Server/reference architecture – Microsoft’s Fast Track Data Warehouse architecture is a reference hardware configuration tailored to run a SQL Server data warehouse workload.
We’ll look briefly at each of these options, but the selection, acquisition, and deployment of data warehouse platform architectures is out of scope for this document.
Data Warehouse Server
The traditional SQL Server data warehouse server for medium-size data volumes is a 64-bit machine running SQL Server Enterprise Edition and configured with a lot of CPU and memory. Given the memory- and compute-intensive workloads for both querying and populating the data warehouse, organizations should purchase hardware with the maximum CPUs and memory configuration they can afford.
These data warehouse servers are typically connected to the corporate SAN. This shared SAN provides centralized administration and is more flexible than storage directly attached to a server. This simplifies administration operations such as backup and disaster recovery. However, the SAN is the primary bottleneck for most data warehouse workloads experiencing performance issues, so pay careful attention to SAN configurations.
One issue with this large-server approach is that it becomes very expensive because multiple servers are required for different phases of the data warehouse lifecycle, including development, test, QA, and production. Traditionally, organizations typically purchased smaller servers for all servers not in production, but virtualization has provided another option.
Server virtualization has emerged as a dominant theme in computing over the past 5 years. Virtualization reduces the number of servers in data centers, which results in lower IT costs from acquisition through ongoing management. It also provides a flexible mechanism for spinning up and tearing down server environments.
This flexibility is useful for non-production environments, but dedicated servers are still the norm for production data warehouse environments. SQL Server 2008 R2 Datacenter Edition is a new product offering from Microsoft for organizations looking for a virtualized SQL Server solution.
SQL Server Fast Track Data Warehouse
The challenges involved in correctly acquiring and configuring a data warehouse hardware and software platform led Microsoft to introduce the SQL Server Fast Track Data Warehouse. This is a “scale-up” reference architecture targeted for data warehouses containing up to 48 TB of data. Options include different reference hardware configurations from HP, Dell, Bull, EMC, and IBM.
The benefits to this solution are that preconfigured, industry-standard hardware provides lower cost of ownership through better price/performance, rapid deployment, and correct configurations.
For details about SQL Server Fast Track Data Warehouse, see these links:
SQL Server 2008 R2 PDW is a highly scalable data warehouse appliance that uses MPP software architecture based on the DataAllegro acquisition. Its target workload is data warehouses containing up to 400 TB of data.
In a traditional, symmetric multi-processing (SMP) architecture, query processing occurs entirely within one physical instance of a database. CPU, memory, and storage impose physical limits on speed and scale.
A PDW MPP appliance partitions large tables across multiple physical nodes, each node having dedicated CPU, memory, and storage and each running its own instance of SQL Server in a parallel, shared-nothing design. All components are balanced against each other to reduce performance bottlenecks, and all server and storage components are mirrored for enterprise-class redundancy.
PDW is a distributed architecture that can act both as the centralized EDW and the hub in a hub-and-spoke architecture, as covered earlier.
You can read about PDW at these links:
SQL Server 2008 R2 Parallel Data Warehouse Web site
Large 64-bit servers running SQL Server Enterprise Edition can support multi-terabyte data warehouses and are a solid choice for organizations today. Here is some guidance for considering one of the alternative options:
Non-production environments – SQL Server 2008 R2 Datacenter Edition provides the flexibility of quickly spinning up and tearing down environments, which lowers the total cost of ownership (TCO) for non-production data warehouse environments.
Performance issues – Performance issues, both directly experienced and projected, can be addressed by scaling up with SQL Server Fast Track Data Warehouse. Pre-configured solutions help minimize common performance problems, such as SAN bottlenecks due to incorrect configurations.
Enterprise data warehouse – PDW provides a platform for the largest organizations’ EDW needs. Its hub-and-spoke architecture supports data marts optimized for target business consumer communities and simplifies data integration between the data warehouse and downstream data marts.
These options allow organizations to standardize their data warehouse on the SQL Server platform—whether the data warehouse is measured in gigabytes or terabytes—first by selecting a platform and then by choosing more scalable platforms as their data warehouse matures.
Now that we’ve briefly covered platform architecture, the next section addresses the data warehouse database architectures implemented on that platform.