This is an update to the technical alert published on 2021-07-14.
All customers are recommended to apply the software update.
Summary:
FlashGrid software update must be applied to prevent incorrect handling of certain failure scenarios and possible database downtime.
Symptoms:
Certain failure scenarios may result in cluster-wide database downtime.
Affected products:
- FlashGrid Cluster on AWS
- FlashGrid Cluster on Azure
- FlashGrid Cluster on GCP
Affected versions and configurations:
FlashGrid Cluster software version 21.08 or earlier.
To determine currently used version of the software, run rpm -qa | grep flashgrid-skycluster
Note that versions will be listed as 21.8 instead of 21.08.
Root cause:
Certain failure scenarios not handled correctly.
Resolution:
FlashGrid Cluster software version 22.03 includes the following improvements:
- Removed initiator check dependency on flashgrid_aio service, which could trigger initiator disconnects when flashgrid_aio service is blocked.
- Logging mechanism in FlashGrid Storage Fabric services changed from synchronous to asynchronous mode, prevents service failures when a system disk becomes inaccessible.
- nvme io_timeout parameter changed from 30 seconds to 11 seconds on RHEL/OL 7, excessive delays of I/O on AWS EC2 Nitro-generation systems (the parameter was changed from unlimited wait in version 21.11).
- Warning about known reliability issues if RH7/OL7 kernel version below 3.10.0-1160.49.1.el7 is used.
- Initiator noop timeout changed from 35 to 15 seconds on Azure-based clusters, improves failure handling by Oracle Clusterware under certain conditions (also included in version 21.06).
Update FlashGrid Cluster software using flashgrid_node_update package of version 22.03 or newer. See update instructions here.