A previous post on IT planning for disaster recovery
described how a large international organization had centralized its IT services and needed to make them more resilient. This post will tell you some of what it did inside the data center.
The first step for this organization was physical redundancy — making sure that the physical resources in its headquarters data center were duplicated so that systems would run even with a hardware failure. That meant that two Internet connections had to be provided, each with ample capacity to handle all the organization’s traffic. It meant that all the information storage had to be redundant, using a storage area network to efficiently provide a pool of fast-backed-up storage to the data center. And enough computer servers had to be configured so that key services could run even if some of the servers failed.
The next step was to tie these systems together in a way that would keep them running. Software had to be installed to run multiple copies of key software systems and to route around failures when they occurred. The general term for this is clustering. When a cluster of separate servers provide a service like email or an intranet website, the service stays up even when one of the members of the cluster goes down. This is useful not only for equipment failures but also for system maintenance; if one cluster member needs hardware maintenance or a software upgrade, it can be temporarily removed from the cluster without disrupting users. Since our customer had users in many time zones around the world, keeping the services up 24/7 required some kind of clustering.
Choices for Microsoft Shops
Since this organization ran mostly Microsoft software, a number of Microsoft-based clustering solutions were employed. For example, Windows Active Directory Services are designed to run using any number of physical servers. It’s very common for organizations to employ two, three or more Windows servers running Active Directory to make sure that their basic network services are provided in a resilient fashion. Similarly, Windows Microsoft SQL has a method for duplicating data between multiple SQL servers in the same location, so that in the event that one server goes down the others can carry on without interruption. For Microsoft Exchange, there are several choices to be made for using multiple machines for resilience. The one chosen by this organization was Cluster Continuous Replication (CCR), which has the added advantage of providing a high level of performance. This organization has about 2,000mailboxes in its Microsoft Exchange system, serving people all over the world, so at certain times of the day it can carry a very heavy load. The CCR system can use as many servers as are integrated into the cluster and spread the load among them so that everyone gets good performance. At the same time, if any one member of the cluster goes down the performance may degrade slightly, but the service will not stop.
The duplication of key equipment, and the clustering of servers, took time to fully implement. But at the end of the day the result was extremely resilient and reliable IT services for users. Once these systems were in place, the headquarters could accept loads that had previously been kept in the organization’s field offices with confidence that service would improve. For example, remote Exchange servers that were kept in certain field offices were shut down and the mailboxes transferred to headquarters. The benefits were immediate: The Exchange server was more easily kept up-to-date and kept online 24/7; access to e-mail when in the office and out of the office was made more uniform; and the total cost to the organization was significantly reduced.
Once operating, this upgraded data center and its services had only one worry: What would happen if the entire data center went down? It wasn’t likely, but it could happen in a number of ways, such as fire, flood or natural disasters that could shut down the headquarters. So the organization sought a way to make the entire data center resilient. Could it duplicate its IT services in another location so that even in the worst case scenario, IT services for the organization would continue? The answer to that question will be in our next blog post