Symptoms
One or more disks were taken offline by ASM
Solution
- Bring the disk(s) online as soon as possible to restore full redundancy (unless the problem is repeating after onlining the disks)
- to online all disks attached to a particular node:
flashgrid-node online
as user fg@ on that node - to online a particular disk (except quorum disks):
asmcmd online -G DGNAME -D 'HOSTNAME$DISKNAME'
as user grid@ - to online a particular quorum disk (a disk on a quorum node):
asmcmd online -G DGNAME -q -D 'HOSTNAME$DISKNAME'
as user grid@ - to online all disks in a disk group:
asmcmd online -G DGNAME -a
as user grid@ - Note:
asmcmd
command can be executed on any databse node, it is not avialable on quorum or storage nodes.
- to online all disks attached to a particular node:
- Wait for the disk group resync to complete and confirm that the disk stays online by running
flashgrid-dg
- Determine the cause of the problem to avoid it happening again:
- Check
/opt/flashgrid-diags/logs/flashgrid-node-monitor-all.log
on the node for clues why the disk could go offline. - Use the table below to determine the likely cause of the disk(s) offline.
- Upload cluster diags to FlashGrid support and open FlashGrid support ticket with a description of the problem.
- Check
Determining the cause of the problem
Several failure types may cause ASM disk(s) to be taken offline. The table below together with the log files can help with determining the exact cause of the problem.
Symptoms | Possible causes |
---|---|
A single disk is offline |
|
Multiple disks belonging to the same node are offline, but some disks on the same node are online |
|
All disks belonging to the same node are offline |
|
Multiple disks belonging to 2 or more different nodes are offline |
|
Notes:
- A single disk going offline may be caused by a transient error that will not repeat. However, if the same disk is taken offline more than once then the disk may be damaged and might need replacement.
- The most common causes of a node instability are connection storms and server hardware problem. Other possible reasons include heavy swapping, out of memory, excessive CPU load - these are less likely on a properly configured system.