Essentially at a very generic level (from SOHO to not that critical services at SME level):
- automated alerts on unusual loads, meaning I do not care about CPU/RAM/disk usage as long as there are specific spikes, so the monitor just send alerts (mails) in case of significant/protracted spikes, tuned after a bit of experience. No need to collect such data over significant periods, you have size your infra on expected loads, you deploy and see if you have done correctly, if so you just need to see usual things to filter them keeping alerts only for anomalies;
- log alerts for errors, warning, access logs etc, same principle, you deploy and collect a bit, than you have "the normal logs", you create alerts for unusual things, retention depend on log types and services you run, some retention could be constrained by laws;
Performance metrics are a totally different thing that's should be decided more by the dev than the operation, and much of it's design depend of the kind of development and services you have. It's much more complex because the monitor itself touch the performance of the system MUCH more than generic alerting an casual ping and alike to check service availability. Push and pull are mixed, for alerts push are the obvious goto, for availability pull are much more sound etc. There is no "one choice".
Personally I tend to be very calm in more fine grain monitoring to start, it's important of course, but should not became an analyze-paralyze trap nor waste too much human resources and IT ones for collection of potential garbage in potentially not marginal batches...
- automated alerts on unusual loads, meaning I do not care about CPU/RAM/disk usage as long as there are specific spikes, so the monitor just send alerts (mails) in case of significant/protracted spikes, tuned after a bit of experience. No need to collect such data over significant periods, you have size your infra on expected loads, you deploy and see if you have done correctly, if so you just need to see usual things to filter them keeping alerts only for anomalies;
- log alerts for errors, warning, access logs etc, same principle, you deploy and collect a bit, than you have "the normal logs", you create alerts for unusual things, retention depend on log types and services you run, some retention could be constrained by laws;
Performance metrics are a totally different thing that's should be decided more by the dev than the operation, and much of it's design depend of the kind of development and services you have. It's much more complex because the monitor itself touch the performance of the system MUCH more than generic alerting an casual ping and alike to check service availability. Push and pull are mixed, for alerts push are the obvious goto, for availability pull are much more sound etc. There is no "one choice".
Personally I tend to be very calm in more fine grain monitoring to start, it's important of course, but should not became an analyze-paralyze trap nor waste too much human resources and IT ones for collection of potential garbage in potentially not marginal batches...