Thursday, 26 September 2013

Xen Host Crashes

Xenserver host crashes/Unresponsiveness are very rare and difficult to find the root cause. However below pointers will definitely help locating and in some case resolving the issue.

XenServer Unresponsiveness - Host goes unresponsive in lean hours and doesn't come back online. All the other services are not reachable. Please check for the c-states in your processor settings. C-states are power saving feature which rests the internal clock for the processor in idle state. However the problem with the states is that XenServer goes in such deep sleep that it it's internal clock never resumes. Even Turbo boost for the processor has been seen sometimes a culprit for the unresponsiveness. These features can be disabled from BIOS.

XenServer Crash - There may be many causes for the XenServer crash. However some highlighted causes for the host crash is OOMKILL, kenel segfault etc

OOMKILL- A short form for out of memory kill. This can happen when some services/module starts consuming all the memory available to the control domain of the Xen host. It can ascertained from the kern logs if the host is having OOMKILL for it's non-working. A definite solution is not possible as the solution changes with scenario, however if possible the host can be taken out of the pool and rebuild which saves production time.

Kernel segfault - The happens when one or other module gets faulted in the kernel. The kernel get segmentation fault raises signal 11(SIGSEGV) which is defined in the header file signal.h and registers the same in the kern logs. Something like this is presented in the kern logs:-

Again the solution to this issue would vary but it would definitely give some idea as to what  went wrong when the host crashed. However it is always recommended to patch the firmwares. Also host should be patched with all citrix hotfixes.

Kernel Panic - Sometimes after the crash, the XenServer will not come up and will show kernel panic on the console. Kernel Panic is an action by Xen hypervisor stating it received fatal error while booting itself and it is not able to recover from it. It can happen because of hardware failure as well as file system corruption.