Case Study: The Importance of IT Resilience, Part III

This post continues our case study of IT planning for disaster recovery at a worldwide NGO. Here, in the final post, we discuss off-site replication.
The timeline of the organization’s build-out was a very logical one. Its first priority was to build a true 24/7 operation that could be grown to support the entire institution. That included reliable and flexible system architectures in the headquarters, as we discussed in the previous post. Only after 24/7 service was firmly established were they in a position to economically support off-site disaster recovery and business continuity.

Replication-technologies

All headquarters assets were in a single location, so it was a priority to develop some form of offsite facility that could provide disaster recovery and business continuity. CGNET was chosen to provide a custom solution that combined several software tools for maximum effect.
CGNET currently operates seven servers in its recovery data center for the organization. These seven servers were implemented as virtual machines on a single large physical server. The physical server has two CPUs, 64 gigabytes of RAM, and enough directly attached storage to contain all of the organization’s virtual machines. This setup is not clustered or internally resilient, since its use is intended only for disaster situations, which are infrequent and short in duration. The virtual machine setup provided an economical solution for disaster recovery.  The CGNET DR architecture was reviewed by its IT department’s external auditors and approved for implementation.

Implementation

Implementation of the solution began in the first quarter of 2010. The first service to be established was Windows domain services. A Windows domain controller was brought up and joined to the headquarters domain. No additional software was needed, since Windows automatically keeps domain controllers synchronized.
The next most important servers to implement were the SQL servers. One of these servers was for general purpose use and was supported by a cluster in the organization’s headquarters. The second server provided ERP support, and was a single server both at headquarters and in the disaster recovery site. CGNET added the CA XOSoft software package to provide data synchronization on a near real time basis for SQL data. The changes are communicated and replicated continuously from headquarters to the DR site. Preserving the contents of these databases in the event of a disaster is of critical importance for many of the organization’s IT services.
Next came the SharePoint servers. One server was the SharePoint application server, including a SharePoint front end server. Again XOSoft was used to keep the contents of the server synchronized between the data centers. Only one front end server and one application server were included in the disaster recovery site, while there were two for on?site resilience at headquarters. It was judged that one server pair would have sufficient capacity to maintain services in the event of disaster.
The final part of the disaster recovery implementation concerned Microsoft Exchange. Microsoft Exchange is implemented at the organization with Exchange 2007 Service Pack 2 running with a CAS server and a clustered configuration for the mailbox server. The Exchange CAS server can be replicated by Exchange without any third-party product, simply by adopting the CAS role when joining the Exchange site.  The mailbox server needs additional software for data replication.  The mailbox cluster uses Microsoft’s Cluster Continuous Replication (CCR) which is not supported by XOSoft at this time. Consequently different software was chosen to replicate the Exchange mailbox server, using a package called Double Take.   (XOSoft does support “single copy cluster” as well as non-clustered Exchange mailbox servers.)
The SharePoint and Exchange services were implemented with a manual failover process. The XOSoft software provides tools which perform most of the detailed changes required in configuration when the replica services in the DR site take over from the headquarters servers. While XOSoft could be programmed to automatically fail over when the production site goes down, it was judged to be safer to require a staff member to approve the failover process rather than it taking place automatically. In addition there are a few small steps such as external DNS changes which are also done manually in the event of failover. Restoration of service to headquarters after fail over is done in a similar fashion, using the XOSoft tools.
A notable feature of XOSoft is its ability to test a replica without completely failing over to it. This “Assured Recovery” provides peace of mind in knowing that the replicated data and the replicated software are fully functioning in the DR site, without having to disrupt normal operations in the primary site. This is a unique capability of XOSoft and is not available in most other DR packages.
The Exchange replication was challenging. It is currently still in progress. It is a very large implementation involving more than 500 gigabytes of user data. To move that much data over wide area links to begin replication has been impractical, so CGNET will receive a backup copy of the message store as an external disc drive shipped from headquarters, with ongoing updates taking place on that baseline copy. Double Take has a good reputation with this sort of replication, but it lacks the assured recovery procedure of XOSoft. Consequently, testing the system will require more time spent in failing the system over and back as a part of the tests.

Future steps

There are still some loose ends left in the system for business recovery.
The organization believes that it can maintain nearly continuous operation on SharePoint and Exchange, and recovery for its SQL?based applications. The operation of these applications is not continuous, because the software that provides end?user services based on these databases has not been replicated onto live servers at the DR site. Some downtime is considered tolerable for these applications in the event of disaster, however.
Ongoing software updates are sure to add additional work in the year ahead. The ERP system may be migrated from the current SUN system to Microsoft Dynamics, for example. An upgrade to SharePoint 2010 will introduce complications. And the ongoing volume of replication transfers must be monitored, to be sure that the links and storage always have sufficient capacity in the event of a disaster. For all these reasons DR and business recovery represent an ongoing commitment on the part of any IT department. The organization studied here  has made an interesting choice regarding which services deserve more attention and resources for business continuity.
admin
About the Author

Leave a Reply

*

captcha *