Note: This document describes the latest FlashGrid Health Checker tool. Information about older versions 20.04 or 20.02 is here.
FlashGrid Health Checker tool performs comprehensive checking of multiple points across storage, network, OS, and other components to identify any errors, misconfiguration, or risk items. The tool can be executed on any system and will perform checks on a FlashGrid server instance or all nodes of the FlashGrid cluster. The tool is non-disruptive and can be executed on a live system.
# flashgrid-health-check -h
usage: flashgrid-health-check [-h] [--version] [command] ...
FlashGrid HealthCheck CLI
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Commands:
[command] Default: show
show Show cluster status
reset-rpm-list Reset rpm list
reset-cfg-list Reset list of cfg files
reset-services-list Reset services list
reset-local-cfg-file Reset file checksum
Health Checker performs the following checks:
-
ASM DiskGroup failgroup (cluster only)- checks that the number of failure groups for each disk group is not lower than the recommended value per the list below.
- NORMAL redundancy: 3
- HIGH redundancy: 5
-
ASM DiskGroup repair_time (cluster only)- checks that
disk_repair_time
andfailgroup_repair_time
attributes are greater than or equal to 2400h. - ASM DiskGroup status - checks mount status, redundancy, total and free MB, offline and lost disks, resync and read-local status, and voting file info.
- Alerts in Storage Fabric logs in the last 7 days - checks alerts in log files under /opt/flashgrid/log and /opt/flashgrid-diags/log directories.
-
Available memory - checks that available memory is >20%:
$ cat /proc/meminfo | grep Available
. - Azure MTU size (cluster only) - verifies that the MTU size is 3920
- Azure disk type - verifies that the disks are Premium SSD or Premium SSD v2
- AZ assignment (cluster only) - checks that a cluster spanning multiple availability zones (AZ) can handle a single AZ failure.
-
Check db memory settings - shows database memory-related parameters, such as
memory_max_target
,memory_target
,sga_max_size
,pga_aggregate_target
,pga_aggregate_limit
(db v12.1 or higher). Also does the total database memory allocation check across all databases. - Check local_listener for each db (cluster only) - checks that database parameter LOCAL_LISTENER = 'NodeFQDN'.
- Check tnsnames.ora (cluster only) - confirms correct (unmodified) entries in tnsnames.ora for DONOTDELETE,NODEFQDN alias.
- Check use_large_pages for each db - checks that database parameter USE_LARGE_PAGES is set to ONLY.
- Email address(es) for sending alerts - checks that at least one non-local email address is configured for email alerts.
- FlashGrid CLAN check – verifies the status of flashgrid-clan service.
- Free system disk space - inspects that disk free space is >20% on / and /u01 (if it exists) mounts.
- HugePages - check that HugePages are enabled and that the actual and expected values configured with FlashGrid are the same.
- Kernel taint check – checks suspicious errors/warnings (Oops, "process stuck for 120 seconds”, etc.) in various logs (during the last 1 week back or last reboot).
-
Mount points check - check the mount options in
/etc/fstab
are valid. -
Multipath blacklist check - check that
multipathd.service
is stopped, and that/etc/multipath.conf
is created with all devices blacklisted. -
NTP time sources - check the following related to chrony:
- No unusable sources. Warning if any
x~?
sources are present. Do not count peers (=
). - Number of usable sources must be !=2 (count only
*+-
sources). Warning if =2. Do not count peers (=
).
- No unusable sources. Warning if any
-
New or 3rd party RPMs installed - verifies if non-standard RPMs are installed after the tool was installed or the list was reset.
-
# flashgrid-health-check reset-rpm-list
command will regenerate the list of installed RPMs.
-
-
New or 3rd party services enabled - examines for non-standard services enabled after the tool was installed or the list was reset.
-
# flashgrid-health-check reset-services-list
regenerates the list of enabled services.
-
-
Reserved memory - check that the reserved kernel memory configured by default in FlashGrid environments is not manually changed in
/etc/sysctl.conf
. -
SF node status – checks if
flashgrid-node
status is good. -
Storage Fabric verification status – shows the status of
flashgrid-cluster verify
command. - Swap disabled – checks if swap is disabled.
-
System config file modifications – detects changes in critical cfg files after installation or last boot.
-
# flashgrid-health-check reset-cfg-list
resets the list of cfg files.
-
- System services – checks and lists failed system services.
Sample report from a two-node cluster:
# flashgrid-health-check
HealthCheck 23.6.37.55799 #49157d14b36057fcdf63ed22b559561c903bf3e1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check: ASM DiskGroup failgroup
rac1: OK
rac2: OK
Check: ASM DiskGroup repair_time
rac1: OK
rac2: OK
Check: ASM DiskGroup status
rac1: WARNING
---------------------------------------------------------------------------------------------------------
GroupName Status Mounted Type TotalMiB FreeMiB OfflineDisks LostDisks Resync ReadLocal Vote
---------------------------------------------------------------------------------------------------------
DATA Good AllNodes NORMAL 40960 27224 0 0 No Enabled None
FRA Warning AllNodes NORMAL 30720 30381 0 0 Yes Enabled None
GRID Good AllNodes NORMAL 10240 9288 0 0 No Enabled 3/3
---------------------------------------------------------------------------------------------------------
rac2: WARNING
---------------------------------------------------------------------------------------------------------
GroupName Status Mounted Type TotalMiB FreeMiB OfflineDisks LostDisks Resync ReadLocal Vote
---------------------------------------------------------------------------------------------------------
DATA Good AllNodes NORMAL 40960 27224 0 0 No Enabled None
FRA Warning AllNodes NORMAL 30720 30381 0 0 Yes Enabled None
GRID Good AllNodes NORMAL 10240 9288 0 0 No Enabled 3/3
---------------------------------------------------------------------------------------------------------
Check: Alerts in Storage Fabric logs in the last 7 days
rac1: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts
rac2: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 54 alerts
racq: WARNING : /opt/flashgrid/log/fg-cluster-error.log: 53 alerts
Check: Available memory
rac1: WARNING : avail mem: 15.4%
rac2: OK : avail mem: 27.7%
racq: OK : avail mem: 75.5%
Check: AZ assignment
rac1: OK
rac2: OK
racq: OK
Check: Check db memory settings
rac1: WARNING
All DBs: sum(pga_aggregate_limit) + max(HugePages, sum(sga_max_size)) >= TotalMemory - 12 GiB
: sum(pga_aggregate_limit) = 4 GiB
: HugePages = 17 GiB
: sum(sga_max_size) = 0 GiB
: TotalMemory = 31 GiB
rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running.
Check: Check local_listener for each db
rac1: OK
rac2: WARNING : Failed to query the db instance 'orcl2'. Check that it is running.
Check: Check tnsnames.ora
rac1: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check.
rac2: OK : Warning: Multiple listener endpoints detected. Skipping tnsnames.ora check.
Check: Check use_large_pages for each db
rac1: WARNING
Use database orcl1
DB instance orcl1 has USE_LARGE_PAGES != ONLY
rac2: WARNING
Use database orcl2
DB instance orcl2 has USE_LARGE_PAGES != ONLY
Check: Email address(es) for sending alerts
rac1: OK
rac2: OK
racq: OK
Check: Flashgrid CLAN check
rac1: OK
rac2: OK
racq: OK
Check: Free system disk space
rac1: OK : /u01: avail 59%, /: avail 85%
rac2: OK : /u01: avail 60%, /: avail 89%
racq: OK : /: avail 89%
Check: HugePages
rac1: OK
rac2: OK
Check: Kernel taint check
rac1: OK
rac2: OK
racq: OK
Check: Mount points check
rac1: OK
rac2: OK
racq: OK
Check: Multipath blacklist check
rac1: OK
rac2: OK
racq: OK
Check: NTP time sources
rac1: OK
rac2: OK
racq: OK
Check: New or 3rd party RPMs installed
rac1: OK
rac2: OK
racq: OK
Check: New or 3rd party services enabled
rac1: OK
rac2: OK
racq: OK
Check: Reserved memory
rac1: OK
rac2: OK
Check: SF node status
rac1: OK
rac2: OK
racq: OK
Check: Storage Fabric cluster verification status
rac1: OK
rac2: OK
racq: OK
Check: Swap disabled
rac1: OK : Swap disabled
rac2: OK : Swap disabled
racq: OK : Swap disabled
Check: System config file modifications
rac1: WARNING
Checksum file not found, using fg_setup.log modification time instead.
/etc/dnsmasq.conf modified since install
rac2: WARNING
Checksum file not found, using fg_setup.log modification time instead.
/etc/sysconfig/iptables modified since install
racq: WARNING
Checksum file not found, using fg_setup.log modification time instead.
/etc/ssh/sshd_config modified since install
Check: System services
rac1: OK
rac2: OK
racq: OK