Choose monitoring/alerting/logging stack/software #432
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Note from call:
Currently, there is no unified monitoring or alerting for resources or services set up. We should decide on a (100% FOSS) stack that we'd like to use.
This could be nice for managing Prometheus: https://openitcockpit.io/Forget about this one. Prometheus (of all things) is like one of two features behind an enterprise edition paywall.
Forgot to put https://victoriametrics.com/products/open-source/ here, which we talked about before. It can ingest data from pretty much everything, and seems to be well-suited for small environments.
I looked into this a bit, and I think the simplest, most widely used and supported solution is still Prometheus + Alertmanager. (And optionally, Grafana dashboards, of course.)
Monitoring system resources has first-class support via Node exporter, so alerts for things like low disk space are very easy to add. Mostly a question of config automation, so we don't have to manually edit rules for every existing or new host.
For uptime monitoring specifically, we could stop paying for UptimeRobot and use Peekaping from some location/VM that isn't one of our main hosting locations. We can either create an XMPP notification adapter for it, or simply add their Webhooks to Hubot Incoming Webhook.