There is a lot of excitement around Microsoft virtualization technologies these days and rightfully so. One of the ‘hottest’ areas right now appears to be making virtual machines highly available using Windows Server R2 Failover Clusters so end users can take maximum advantage of Live Migration and Cluster Shared Volumes (CSV). This configuration not only saves a lot of money but also provides business continuity in the event of an unforeseen failure in the environment.
While I could spend time extolling the virtues of our virtualization technologies, I am really here to discuss what can happen if one were to get too ‘overzealous’ and not use common sense and a sound plan for implementing the solution correctly. As with many of the blogs you read here on the Rustyice blog, they have been written because of experiences we have had with our customers. This one is no different.
So, what happens when a customer decides they love Microsoft virtualization and high availability technologies so much, they want to virtualize their entire infrastructure? And, suppose they want to be sure it’s highly available so they create a multi-node Failover Cluster to host the virtual machines. When the customer completes the project, they are so very proud of what they have done because now they can retire their old hardware and save tons of money on power and cooling costs in their datacenter. Everyone is happy and celebrations abound. And, then it happens…..someone decides that they need to shutdown the cluster(s), for whatever reason, it does not matter, and, after awhile, when they decide it is OK to bring the cluster(s) back online…they cannot. Oh, and one more thing…..the clusters are running on Windows Server 2008 R2 CORE. Trust me…this is a true story and has already happened more than once, hence the impetus behind this blog.
If the predicament is not immediately obvious, and it should be for cluster veterans, I will tell you that the cluster service will fail to start because it cannot contact a Domain Controller somewhere in Active Directory. And, this is because all of the Domain Controllers and DNS servers (critical infrastructure servers) have been virtualized and are, in fact, virtual machines currently supported by the cluster that is trying to start up. Clearly, this is a case of having ones eggs all in one basket – not good.
How did we fix this? It was not a quick fix. In a nutshell, what the Support Engineer did was have the customer determine which storage LUN was hosting the VM files for one of the virtualized Domain ControllerDNS servers. Then, the LUN was mapped to a standalone server so the VHD file could be copied off to another standalone Hyper-V server so a new VM could be created and placed in service. Once this was accomplished, the cluster could be started.
How can this type of scenario be avoided?
1. Develop a solid, well thought out migration plan. Ensure the planning team includes people who understand how all the technologies function in a virtualized environment.
2. Have at least one physical Domain ControllerDNS server available in the environment.
3. If #2 is not an option, distribute the virtualized infrastructure servers across multiple hyper-v clusters and hope they will not all be Offline at the same time.
4. Plan to have one of more Hyper-V servers running in a WORKGROUP configuration. Hyper-V servers do not have to be joined to an Active Directory domain. Then distribute some of the virtualized infrastructure servers across these servers.