Choose monitoring/alerting/logging stack/software #432

New Issue

2022-10-18T12:11:00Z

raucao commented

2022-10-18 12:11:00 +00:00

Note from call:

Currently, there is no unified monitoring or alerting for resources or services set up. We should decide on a (100% FOSS) stack that we'd like to use.

Note from call: Currently, there is no unified monitoring or alerting for resources or services set up. We should decide on a (100% FOSS) stack that we'd like to use.

raucao added the idea label 2022-10-18 12:11:00 +00:00

raucao commented

2022-11-24 10:24:17 +00:00

~~This could be nice for managing Prometheus: https://openitcockpit.io/~~

Forget about this one. Prometheus (of all things) is like one of two features behind an enterprise edition paywall.

<del>This could be nice for managing Prometheus: https://openitcockpit.io/</del> Forget about this one. Prometheus (of all things) is like one of two features behind an enterprise edition paywall.

raucao commented

2022-11-27 08:52:04 +00:00

Forgot to put https://victoriametrics.com/products/open-source/ here, which we talked about before. It can ingest data from pretty much everything, and seems to be well-suited for small environments.

raucao added the feature label 2025-04-22 14:34:32 +00:00

raucao added this to the 2025 project 2025-04-22 14:34:44 +00:00

raucao moved this to Epics in 2025 on 2025-04-22 15:04:10 +00:00

raucao modified the project from 2025 to 2026

2026-01-15 07:30:59 +00:00

raucao moved this to To Do in 2026 on 2026-01-21 04:09:48 +00:00

raucao commented

2026-05-31 09:20:07 +00:00

I looked into this a bit, and I think the simplest, most widely used and supported solution is still Prometheus + Alertmanager. (And optionally, Grafana dashboards, of course.)

Monitoring system resources has first-class support via Node exporter, so alerts for things like low disk space are very easy to add. Mostly a question of config automation, so we don't have to manually edit rules for every existing or new host.

For uptime monitoring specifically, we could stop paying for UptimeRobot and use Peekaping from some location/VM that isn't one of our main hosting locations. We can either create an XMPP notification adapter for it, or simply add their Webhooks to Hubot Incoming Webhook.

I looked into this a bit, and I think the simplest, most widely used and supported solution is still [Prometheus](https://prometheus.io/) + [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/). (And optionally, Grafana dashboards, of course.) Monitoring system resources has first-class support via [Node exporter](https://github.com/prometheus/node_exporter), so alerts for things like low disk space are very easy to add. Mostly a question of config automation, so we don't have to manually edit rules for every existing or new host. For uptime monitoring specifically, we could stop paying for UptimeRobot and use [Peekaping](https://peekaping.com/) from some location/VM that isn't one of our main hosting locations. We can either create an XMPP notification adapter for it, or simply add their Webhooks to [Hubot Incoming Webhook](https://github.com/67P/hubot-incoming-webhook).

👍 1

raucao commented

2026-07-03 17:55:30 +00:00