Active monitoring is a different animal from passive metrics collection. Which is different from log transport.
The Nagios ecosystem was fragmented for the longest time but now it seems most users have drifted towards Icinga, so this is what I use for monitoring. There is some basic integration with Grafana for metrics, so that is for I use for metrics panels. There is good reason to not use your innovation budget on monitoring, instead use simple software that will continue to be around for a long time.
As for what to monitor, that is application specific and should go into the application manifest or configuration management. But generally there should be some sort of active operation that touches the common data path, such as a login, creation of a dummy object such as for example an empty order, validation of said object, and destruction/clean up.
Outside the application there should be checks for whatever the application relies on. Working DNS, NTP drift, Ansible health, certificate validity, applicable APT/RPM packages, database vacuums, log transport health, and the exit status or last file date of scheduled or backgrounded jobs.
Metrics should be collected for total connections, their return status, all types of I/O latency and throughput, and system resources such as CPU, memory, disk space.
The Nagios ecosystem was fragmented for the longest time but now it seems most users have drifted towards Icinga, so this is what I use for monitoring. There is some basic integration with Grafana for metrics, so that is for I use for metrics panels. There is good reason to not use your innovation budget on monitoring, instead use simple software that will continue to be around for a long time.
As for what to monitor, that is application specific and should go into the application manifest or configuration management. But generally there should be some sort of active operation that touches the common data path, such as a login, creation of a dummy object such as for example an empty order, validation of said object, and destruction/clean up.
Outside the application there should be checks for whatever the application relies on. Working DNS, NTP drift, Ansible health, certificate validity, applicable APT/RPM packages, database vacuums, log transport health, and the exit status or last file date of scheduled or backgrounded jobs.
Metrics should be collected for total connections, their return status, all types of I/O latency and throughput, and system resources such as CPU, memory, disk space.