Alertmanager incident response automation with n8n

Benoit Perroud
3 min readJan 9, 2023

The prometheus monitoring stack includes an alert dispatching component called alertmanager. Many integrations are available to dispatch the alerts to pager, slack, etc… i.e. notification channels. But how to bring easily and efficiently automated responses is the question we’ll try to answer here.

Where to dispatch alerts for automated responses?

How to install and configure Prometheus and Alertmanager is out of the scope of this article, and we refer to the official documentation for that. We simply assume that an application is monitored and alerting rules are set.

For automating responses to incident, the automation tool n8n will be used. N8n is a low-code/no-code automation tool with a really nice and slick UI, awesome developer experience while keeping a high degree of customization.

N8n is a good candidate to automate incident responses

Restart the process under certain conditions

The most common incident response that everyone working in the IT heard at some point or another is “reboot”. In our example, we’ll target more specifically “restart a process”.

Let’s assume the following landscape: We have an application running on kubernetes. The application is composed of a database (postresql for instance), a frontend (reading from the database) and a backend (writing to postgresql). All components expose metrics, scraped by Prometheus.

From time to time, the backend enters a buggy state and stops writing the the db. The frontend continues working properly but serves outdated or stale data. In this configuration, restarting the backend is enough to fall back to a working state.

Hopefully we do have a metric (and hence a derived alert rule) which can detect when the backend is in the buggy state, and hence get notified in Slack when this scenario happens. But as any lazy engineer, since the mitigation of the problem is as simple as restarting the backend, why not automating this process? (Of course fixing the backend is another good option, but it does not fit well in this story :))

Webhook to trigger a workflow

To automate this process in n8n, the idea is to define a webhook in Alertmanager which will trigger a n8n workflow. The workflow should eventually restart the backend deployment in kubernetes, and optionally confirm a successful restart in some channel like Slack. Put all that in a diagram, it would looks like the following:

Alertmanager sending a webhook notification to n8n

The details of the n8n workflow looks like the following.

n8n workflow triggered by alertmanager, calling kubectl rollout restart

Now the trick is to configure prometheus rule to trigger events when needed, with the right label Alertmanager can route properly. Note the specific label automation in the snippet below:

groups:
- name: "Application liveness"
rules:
- alert: "Application liveness"
expr: avg(increase(live[10m])) by (instance) <= 0
for: 30s
labels:
automation: application-liveness-mitigation

Alertmanager will have a specific receiver pointing the the n8n webhook, and a route reacting to the label used in the rule, as shown in the snippet below:

route:
...
routes:
- receiver: n8n-automation
group_by: ['automation']
matchers:
- automation = "application-liveness-mitigation"
continue: false

receivers:
...
- name: n8n-automation
webhook_configs:
- send_resolved: false
url: https://...n8n.../webhook/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee

And that’s it! When the expression avg(increase(live[10m])) by (instance) <= 0 is true for 30 seconds, the n8n webhook is called and the underlying workflow is executed, taking any meaningful action to mitigate the incident.

--

--

Benoit Perroud

I’m building and running Kubernetes, Kafka, ElasticSearch, Hadoop & Friends @sqoobaio