Summary:
Attempt to reboot a node by Oracle Clusterware may result in OS freeze instead of reboot.
Symptoms:
Under certain failure conditions (such as loss of network or storage connectivity) Oracle Clusterware performs node fencing by rebooting the node. The node may hang instead of rebooting. Manual reset of the server/VM is needed if the node hangs.
Affected products:
Any Oracle RAC or Failover HA cluster
Affected versions:
- RHEL 8 based clusters
- OL 8 based clusters running RHCK kernel
Note: RHEL/OL 7 based clusters are NOT affected
Root cause:
Prior to GI 19.13 (Oct 2021) Release Update node fencing with reboot was done by sending SysRq reboot command to kernel. In GI 19.13 (Oct 2021) Release Update node fencing is done by sending SysRq crash command to kernel. Also, the crash command is sent from two Clusterware processes in parallel. Latest RHEL 8 kernel may hang when it receives two SysRq crash commands simultaneously.
Resolution:
Apply GI 19.14 (Jan 2022) Release Update instead of 19.13.
Alternatively, you can use one of the following workarounds:
- Apply GI one-off patch 33563477 on top of GI 19.13 (October 2021) Release Update.
- On Oracle Linux 8 switch to UEKR6 kernel.
Azure customers, please also see RH 8.5 kernel on Azure may hang during node fencing instead of rebooting