ESXi Fails with “Corruption in dlmalloc” on HPE Server
Why ESXi Fails with “Corruption in dlmalloc” on HPE Server?
“Corruption in dlmalloc” issue occurs because multiple esxcfg-dumppart threads attempt to free memory which has been used for configuring the dump partition. Thread A checks if there are entries to be freed and proceeds to free them, while within the same time frame, Thread B is also attempting to free the same entries.
Based on VMware KB2147888, this issue is resolved on ESXi 6 U3. But why issue is happening on ESXi 6 U3 or ESXi 6.5 U1 when they are installed on HPE ProLiant servers?
What’s My Story with “Corruption in dlmalloc”?
We have some monster ESXi hosts in our environment, those hosts have Intel NIC (HPE OEM) and Broadcom NIC (HPE OEM) for 1Gb or 10Gb connections. We had a freakish issue on those ESXi hosts, virtual machines traffic was blocked during virtual machine migration between hosts. When I checked the issue deeply, I saw that virtual machine’s traffic is blocked and it seems, NIC has no configured VLAN (Below Screenshot). I was sure about network configurations, so I prepared a plan for upgrade Intel NIC driver on all hosts cause of the new version (1.7.10) release note:
- Fix for the NIC down procedure hanging when heavy traffic is running.
I should mention that there was no new driver on HPE support center and I had to download the new version from VMware.
Everything was fine until yesterday, one of ESXi host was crashed and the below purple screen was appeared:
Also I found the below logs in /var/log/vmkernel.log:
OCD_Agent[14573719]: intelcim:Intelcim_DeviceSensorProviderInit: initialize the Provider. OCD_Agent[14573719]: find_loaded_drivers: Unable connect to driver igbn OCD_Agent[14573719]: No Intel OCS devices found. Aborting. OCD_Agent[14573719]: ocs_init failed, no OCS polling OCD_Agent[14573719]: No Intel OCB devices found. Aborting. OCD_Agent[14573719]: ocb_init failed, no ocb polling OCD_Agent[14573719]: No devices to poll
Also “sfcbd-watchdog” service was stopped and started frequently.
After a simple search, I found the solution for the issue on HPE Support.
What’s Solution or Workaround?
Here is the title of HPE advisory document:
Advisory: VMware – Systems Running the HPE Customized Image for VMwarecESXi 6.0 U3 or VMware ESXi 6.5 U1 May Encounter A Failure in The sfcb-intelcim process “sfcb-intelcim: wantCoreDump:sfcb-intelcim signal:6 or 11 exitCode:0 coredump:enabled”
On any HPE ProLiant server running the HPE customized image for VMware ESXi 6.0 U3 or VMware ESXi 6.5 U1, the sfcb-intelcim process may fail on the VMware ESXi host creating a zdump file in the /var/core/. This occurs because the Intel Common Information Model (CIM) provider is not compatible with native VMware drivers.
As I mentioned before, I had to use NIC driver version (1.7.10) to resolving the issue and HPE didn’t release new version and last published version was same as installed version.
Based on the solution on the HPE advisory document, Intel CIM provider should be removed:
/etc/init.d/sfcbd-watchdog stop esxcli software vib remove -n=intelcim-provider /etc/init.d/sfcbd-watchdog start
Also the Intel-cim provider will no longer be included in the HPE customized images for VMware ESXi 6.0, VMware ESXi 6.5 and VMware ESXi 6.7 and a fixed version needs to be installed manually when it becomes available from hpe.com.
Here is the advisory document ID: a00048925en_us
Read the advisory document for more information.
See Also
[Script]: Check Time Synchronization with Host on Virtual Machines – PowerCLI