Note: This article is for Health Checker versions 20.02 and 20.04 only. For newer versions see this article.
FlashGrid Health Checker tool performs comprehensive checking of multiple points across storage, network, OS, and other components to identify any errors, misconfiguration, or risk items. The tool can be executed on any node of the cluster and will perform checks on all nodes in the cluster. Execution of the tool is non-disruptive and can be performed on a live cluster.
# flashgrid-health-check -h usage: flashgrid-health-check [-h] [--version] [command] ... FlashGrid HealthCheck CLI optional arguments: -h, --help show this help message and exit --version show program's version number and exit Commands: [command] Default: show show Show cluster status reset-rpm-list Reset rpm list reset-cfg-list Reset list of cfg files reset-services-list Reset services list
FlashGrid Health Checker performs the following checks:
- ASM DiskGroup status - checks mount status, redundancy, total and free MB, offline and lost disks, resync and readlocal status, voting file info
- Alerts in Storage Fabric logs in the last 7 days - checks alerts in log files under /opt/flashgrid/log and /opt/flashgrid-diags/log directories
- Available memory - checks that available memory is >20%: $ cat /proc/meminfo | grep Available
- Check db memory settings - shows database memory related parameters, such as
memory_max_target
,memory_target
,sga_max_size
,pga_aggregate_target
,pga_aggregate_limit
(db v12.1 or higher). Also does the total database memory allocation check across all databases. - Check local_listener for each db - checks that LOCAL_LISTENER = 'NodeFQDN'
- Check tnsnames.ora - confirms correct (unmodified) entries in tnsnames.ora for DONOTDELETE,NODEFQDN alias.
- Flashgrid CLAN check – verifies status of flashgrid-clan service
- Free system disk space - inspects that disk free space is >30% on all nodes on / and /u01 (if exists) mounts
- Kernel taint check – checks suspicious errors/warnings (Oops, "process stuck for 120 seconds”, etc.) in various logs (during the last 1 week back or last reboot)
- SF node status – checks if
flashgrid-node
status is good - Storage Fabric cluster verification status – shows status of
flashgrid-cluster verify
command - Swap disabled – checks if swap is disabled
- System config file modifications – detects changes in critical cfg files after install or last boot
# flashgrid-health-check reset-cfg-list
resets list of cfg files
- System services – checks and lists failed system services
- Unexpected or 3rd party RPMs installed – verifies if non-standard RPMs are installed after the tool was installed or list was reset.
# flashgrid-health-check reset-rpm-list
command will regenerate the list of installed RPMs
- Unexpected or 3rd party services enabled - examines for non-standard services enabled after the tool was installed or list was reset.
# flashgrid-health-check reset-services-list
regenerates the list of enabled services
Sample report from a two-node cluster:
# flashgrid-health-check HealthCheck 20.4.33.71909 #4077668bfae738b12c9ddc900f2262693c85c566 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check: ASM DiskGroup status rac1: WARNING --------------------------------------------------------------------------------------------------------- GroupName Status Mounted Type TotalMiB FreeMiB OfflineDisks LostDisks Resync ReadLocal Vote --------------------------------------------------------------------------------------------------------- DATA Good AllNodes NORMAL 40960 27224 0 0 No Enabled None FRA Warning AllNodes NORMAL 30720 30381 0 0 Yes Enabled None GRID Good AllNodes NORMAL 10240 9288 0 0 No Enabled 3/3 --------------------------------------------------------------------------------------------------------- rac2: WARNING --------------------------------------------------------------------------------------------------------- GroupName Status Mounted Type TotalMiB FreeMiB OfflineDisks LostDisks Resync ReadLocal Vote --------------------------------------------------------------------------------------------------------- DATA Good AllNodes NORMAL 40960 27224 0 0 No Enabled None FRA Warning AllNodes NORMAL 30720 30381 0 0 Yes Enabled None GRID Good AllNodes NORMAL 10240 9288 0 0 No Enabled 3/3 --------------------------------------------------------------------------------------------------------- Check: Alerts in Storage Fabric logs in the last 7 days rac1: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts rac2: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 54 alerts racq: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts Check: Available memory rac1: WARNING : avail mem: 15.4% rac2: OK : avail mem: 27.7% racq: OK : avail mem: 75.5% Check: Check db memory settings rac1: WARNING All DBs: sum(pga_aggregate_limit) + max(HugePages, sum(sga_max_size)) >= TotalMemory - 12 GiB : sum(pga_aggregate_limit) = 4 GiB : HugePages = 17 GiB : sum(sga_max_size) = 0 GiB : TotalMemory = 31 GiB rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running. Check: Check local_listener for each db rac1: OK rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running. Check: Check tnsnames.ora rac1: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check. rac2: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check. Check: Flashgrid CLAN check rac1: OK rac2: OK racq: OK Check: Free system disk space rac1: OK : /u01: avail 59%, /: avail 85% rac2: OK : /u01: avail 60%, /: avail 89% racq: OK : /: avail 89% Check: Kernel taint check rac1: OK rac2: OK racq: OK Check: SF node status rac1: OK rac2: OK racq: OK Check: Storage Fabric cluster verification status rac1: OK rac2: OK racq: OK Check: Swap disabled rac1: OK : Swap disabled rac2: OK : Swap disabled racq: OK : Swap disabled Check: System config file modifications rac1: WARNING Checksum file not found, using fg_setup.log modification time instead. /etc/dnsmasq.conf modified since install rac2: WARNING Checksum file not found, using fg_setup.log modification time instead. /etc/sysconfig/iptables modified since install racq: WARNING Checksum file not found, using fg_setup.log modification time instead. /etc/ssh/sshd_config modified since install Check: System services rac1: OK rac2: OK racq: OK Check: Unexpected or 3rd party RPMs installed rac1: OK rac2: OK racq: WARNING : telnet Check: Unexpected or 3rd party services enabled rac1: OK rac2: OK racq: OK