Symptoms
All cluster nodes are up, but CRS is not able to start on one or more nodes.
Solution
-
Run
flashgrid-cluster
command to confirm that all nodes are up and no nodes are shown as Inaccessible. -
Kill all
ohasd.bin reboot
processes on the database nodes where CRS fails to start:ps -ef | grep "ohasd.bin reboot" | grep -v grep | awk '{print $2}' | xargs kill -9
-
Ensure that
ohasd.service
andoracle-ohasd.service
services are running:systemctl is-active ohasd.service || systemctl start ohasd.service systemctl is-active oracle-ohasd.service || systemctl start oracle-ohasd.service
If one of these two services was inactive, then it should start now and should also automatically start CRS if auto-start is enabled. In such case the next step is expected to show the following message
CRS-4640: Oracle High Availability Services is already active
. -
Start CRS:
crsctl start crs -wait
-
If the previous step shows
CRS-4640: Oracle High Availability Services is already active
then wait for several minutes and check the status:crsctl check crs
.
Cause of the problem
CRS may get stuck in a failed state, unable to open OCR, if during CRS start one or more disks containing voting files are offline. This is likely to happen in one of the following scenarios:
- Manual start of CRS after the entire cluster was down and while one of the other nodes is still down.
- While CRS is starting on one of the nodes, another node is rebooted or stopped.
How to prevent the problem
- Do not start CRS manually while any other node is down. If CRS autostart is enabled then CRS will start automatically after all nodes are up.
- Always use
flashgrid-node reboot
command for rebooting a node. - Do not reboot more than one node at a time. You can reboot a node only after all other nodes are up, CRS is running, all disks are online, and there is no active Resync.
- If you need to reboot the entire cluster, do not reboot the nodes simultaneously. Instead, stop all nodes and then start all nodes.