FlashGrid Node Monitor service is part of flashgrid-diags
package. It provides monitoring of various system health indicators, including CPU utilization, available memory, and clocks. Default parameters are suitable for most environments. However, in certain cases, the parameters can be customized by overriding the default values.
Note: No alert is generated when a VM is gracefully stopped or rebooted. To monitor VM state change, you should configure notification settings from your cloud portal. Check the corresponding documentation for your cloud:
- Azure: Azure Resource Health
- AWS: Amazon SNS and EventBridge
- GCP: Google Monitoring
Prerequisites
To customize monitoring parameters, run the following commands (in a cluster, run on each node):
- Add or modify the required parameters in the following file. Create it if it does not exist:
/etc/flashgrid-diags.cfg
- Restart flashgrid-node-monitor service:
$ sudo systemctl restart flashgrid-node-monitor
- Confirm that the service restarted successfully
$ sudo systemctl status flashgrid-node-monitor
Example of increasing CPU monitoring interval to 300 seconds:
[cpu_monitor]
check_interval = 300.0 # s
CPU monitoring parameters and default values
alerts
- enable or disable email alerts
enable
– switch on/off monitoring and logging of CPU usage
check_interval
- CPU utilization measurement interval in seconds. Utilization will be averaged over this interval. Increasing this interval will make it less sensitive to short spikes of CPU utilization.
max_usage
– CPU utilization threshold that will trigger the alerts and/or warnings in the log.
[cpu_monitor]
alerts = yes
enable = yes
check_interval = 20.0 # s
max_usage = 80 # % of all CPUs
Memory monitoring parameters and default values
alerts
- enable or disable email alerts
enable
- switch on/off monitoring and logging of memory usage
check_interval
- determines how frequently the check is done in seconds
min_available_pct
- minimum available memory threshold in percent relative to the total system memory
min_available_mb
- minimum available memory threshold in MB
[memory_monitor]
alerts = yes
enable = yes
check_interval = 10.0 # s
min_available_pct = 2 # % of all Memory
min_available_mb = 256 # MB
Clock monitoring parameters and default values
alerts
- enable or disable email alerts
enable
- switch on/off clock monitoring
check_interval
- determines how frequently the check is done in seconds
max_err
- maximum permitted time warp in seconds for each check
[clock_monitor]
alerts = yes
enable = yes
check_interval = 1.0 # s
max_err = 1.0 # s per check
Free file system space
alerts
- enable or disable email alerts
enable
- switch on/off free file system space monitoring
check_interval
- determines how frequently the check is done in seconds
min_fs
- list of file systems to monitor and corresponding % of free space thresholds.
[fs_monitor]
alerts = yes
enable = yes
check_interval = 3600.0 # s
min_fs = {'/': 20, '/u01': 20} # min % of free space
Note that, in the FlashGrid cluster environment, the /u01
file system is usually present on database nodes only. It should not be listed on quorum or storage nodes.