Alertmanager incident response automation with n8n
The prometheus monitoring stack includes an alert dispatching component called alertmanager. Many integrations are available to dispatch the alerts to pager, slack, etc… i.e. notification channels. But how to bring easily and efficiently automated responses is the question we’ll try to answer here.
How to install and configure Prometheus and Alertmanager is out of the scope of this article, and we refer to the official documentation for that. We simply assume that an application is monitored and alerting rules are set.
For automating responses to incident, the automation tool n8n will be used. N8n is a low-code/no-code automation tool with a really nice and slick UI, awesome developer experience while keeping a high degree of customization.
Restart the process under certain conditions
The most common incident response that everyone working in the IT heard at some point or another is “reboot”. In our example, we’ll target more specifically “restart a process”.
Let’s assume the following landscape: We have an application running on kubernetes. The application is composed of a database (postresql for instance), a frontend (reading from the database) and a backend (writing to postgresql). All components expose metrics, scraped by Prometheus.
From time to time, the backend enters a buggy state and stops writing the the db. The frontend continues working properly but serves outdated or stale data. In this configuration, restarting the backend is enough to fall back to a working state.
Hopefully we do have a metric (and hence a derived alert rule) which can detect when the backend is in the buggy state, and hence get notified in Slack when this scenario happens. But as any lazy engineer, since the mitigation of the problem is as simple as restarting the backend, why not automating this process? (Of course fixing the backend is another good option, but it does not fit well in this story :))
Webhook to trigger a workflow
To automate this process in n8n, the idea is to define a webhook in Alertmanager which will trigger a n8n workflow. The workflow should eventually restart the backend deployment in kubernetes, and optionally confirm a successful restart in some channel like Slack. Put all that in a diagram, it would looks like the following:
The details of the n8n workflow looks like the following.
Now the trick is to configure prometheus rule to trigger events when needed, with the right label Alertmanager can route properly. Note the specific label automation in the snippet below:
groups:
- name: "Application liveness"
rules:
- alert: "Application liveness"
expr: avg(increase(live[10m])) by (instance) <= 0
for: 30s
labels:
automation: application-liveness-mitigation
Alertmanager will have a specific receiver pointing the the n8n webhook, and a route reacting to the label used in the rule, as shown in the snippet below:
route:
...
routes:
- receiver: n8n-automation
group_by: ['automation']
matchers:
- automation = "application-liveness-mitigation"
continue: false
receivers:
...
- name: n8n-automation
webhook_configs:
- send_resolved: false
url: https://...n8n.../webhook/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
And that’s it! When the expression avg(increase(live[10m])) by (instance) <= 0 is true for 30 seconds, the n8n webhook is called and the underlying workflow is executed, taking any meaningful action to mitigate the incident.