Customizing Node Monitor Settings – FlashGrid Help Center

FlashGrid Node Monitor service is part of flashgrid-diags package. It provides monitoring of various system health indicators, including CPU utilization, available memory, and clocks. Default parameters are suitable for most environments. However, in certain cases, the parameters can be customized by overriding the default values.

Note: No alert is generated when a VM is gracefully stopped or rebooted. To monitor VM state change, you should configure notification settings from your cloud portal. Check the corresponding documentation for your cloud:

Prerequisites

To customize monitoring parameters, run the following commands (in a cluster, run on each node):

Add or modify the required parameters in the following file. Create it if it does not exist:
```
/etc/flashgrid-diags.cfg
```

Restart flashgrid-node-monitor service:

$ sudo systemctl restart flashgrid-node-monitor

Confirm that the service restarted successfully

$ sudo systemctl status flashgrid-node-monitor

Example of increasing CPU monitoring interval to 300 seconds:

[cpu_monitor]
check_interval = 300.0  # s

CPU monitoring parameters and default values

alerts - enable or disable email alerts

enable – switch on/off monitoring and logging of CPU usage

check_interval - CPU utilization measurement interval in seconds. Utilization will be averaged over this interval. Increasing this interval will make it less sensitive to short spikes of CPU utilization.

max_usage – CPU utilization threshold that will trigger the alerts and/or warnings in the log.

[cpu_monitor]
alerts = yes
enable = yes
check_interval = 20.0  # s
max_usage = 80 # % of all CPUs

Memory monitoring parameters and default values

alerts - enable or disable email alerts

enable - switch on/off monitoring and logging of memory usage

check_interval - determines how frequently the check is done in seconds

min_available_pct - minimum available memory threshold in percent relative to the total system memory

min_available_mb - minimum available memory threshold in MB

[memory_monitor]
alerts = yes
enable = yes
check_interval = 10.0  # s
min_available_pct = 2 # % of all Memory
min_available_mb = 256 # MB

Clock monitoring parameters and default values

alerts - enable or disable email alerts

enable - switch on/off clock monitoring

check_interval - determines how frequently the check is done in seconds

max_err - maximum permitted time warp in seconds for each check

[clock_monitor]
alerts = yes
enable = yes
check_interval = 1.0  # s
max_err = 1.0 # s per check

Free file system space

alerts - enable or disable email alerts

enable - switch on/off free file system space monitoring

check_interval - determines how frequently the check is done in seconds

min_fs - list of file systems to monitor and corresponding % of free space thresholds.

[fs_monitor]
alerts = yes
enable = yes
check_interval = 3600.0  # s
min_fs = {'/': 20, '/u01': 20} # min % of free space

Note that, in the FlashGrid cluster environment, the /u01 file system is usually present on database nodes only. It should not be listed on quorum or storage nodes.

Oracle sessions monitoring (added in version 25.05)

enable - switch on/off Oracle sessions monitoring

alerts - enable or disable email alerts

monitoring - enable or disable writing of the current number of Oracle sessions to the log on each check

max_dedicated_servers_per_core - the number of Oracle processes per CPU core that will trigger the alerts and/or warnings in the log

check_interval - determines how frequently the check is done in seconds

alert_interval - sets the frequency, in seconds, of how often an alert email is sent

[oracle_sessions_monitor]
enable = yes
alerts = yes
monitoring = no # enable for debug only
max_dedicated_servers_per_core = 100
check_interval = 20.0  # s
alert_interval = 21600 # s, equal to 6h

Monitoring number of asynchronous I/O requests (added in version 25.05)

enable - switch on/off monitoring

check_interval - determines how frequently the check is done in seconds

alert_threshold_pct - asynchronous I/O utilization threshold percentage that will trigger the alerts

alert_interval - sets the frequency, in seconds, of how often an alert email is sent

[aio_monitor]
enable = yes
check_interval = 10.0  # s
alert_threshold_pct = 50
alert_interval = 21600 # s, equal to 6h