ASM may take disks offline because of long latencies. FlashGrid services monitor when the failed disks become available again and try to initiate the ASM ONLINE operation. However, it can lead to excessive alerts through Oracle Enterprise Manager (OEM). This article discusses the recommended configuration of the default OEM metrics related to offline disks.
Default configuration
There are two metrics that can trigger alerts when a disk goes offline: Disk Status and Offline Disk Count
By default, both metrics are collected every 15 minutes and trigger an alert on the first occurrence of the event. On the other hand, FlashGrid services are expected to bring the failed disks online after a transient error within 10 minutes. Therefore, the default OEM settings can lead to alerts even when FlashGrid services resolve the issue (e.g. when a metric was collected within a 10-minute period).
Recommended configuration
It is advised to trigger an alert when the issue is not resolved automatically. Metric settings should be customized to set the alert status only after the 2nd occurrence of the event.
Go to the Cluster ASM target -> Monitoring -> Metric and Collection Settings
Select the Disk Status metric and click on Edit
Click Edit on Monitored Objects
Set the Number of Occurrences to 2 and click Continue
Click Continue. Similarly, set the number of occurrences to 2 for the Offline Disk Count metric and click OK to save the changes to the repository:
If there is a monitoring template in place, the occurrence should be set in it similarly.
This way, intermittent disk errors, which are resolved by FlashGrid services automatically, will not trigger OEM alerts.