False positive alerts sent for applications in EU

Incident Report for Mendix Technology

Postmortem

During a recent upgrade of a component in our Cloud V4 we sent out a large number of false positives on the Application Status or runtime heartbeat alert.

We use a blue-green process for releasing kapacitor (the alerting engine that generates alerts and sends them to our alerting database). In a blue-green process an instance of an application with the new version is spun op next to the currently active version of the application. Requests are then routed to the new application instance. After this, the old application instance is shut down.

To decrease the heartbeat timeout we ran ~20 releases. We ran this number because we were incrementally decreasing the timeout value in small steps while monitoring the alerts sent out to our customers. In one of these releases the blue-green process failed. The new alerting engine was spun up, but the old version of the alerting engine was not shut down. As a consequence spurious alerts were sent to our alerting database, causing load problems on our alerting database. This did not show up until later in the day, when we had already finished the actual release. We upscaled our alerting dataabase.

Although we communicated to our customers that they could expect false positives, the problem we experienced during the release were unrelated to the false positives we expected.

The corrective measures we will take:

Do not use a blue-green process for our alerting services. It is not really required and there are additional risks associated with it.
We have upscaled our alerting database so that it can handle higher processing loads.

Posted Aug 15, 2018 - 08:42 UTC

Resolved

This incident has been resolved.

Posted Aug 09, 2018 - 07:56 UTC

Update

The heavy load was caused by today's maintenance. After our blue-green upgrade, the previous version of our alerting server did not shut down properly. This meant that two alerting components were putting alerts into our alerting database, which was unable to process this double load over a longer period.

Posted Aug 08, 2018 - 19:14 UTC

Monitoring

Due to heavy load, the alerting system started to fail. We have scaled up the related component and are monitoring the recovery.

Posted Aug 08, 2018 - 19:07 UTC

Investigating

We're investigating why incorrect alerts were sent out to subscribers of application alerts.

Posted Aug 08, 2018 - 18:12 UTC

This incident affected: Mendix Cloud (Mendix Cloud EU (Frankfurt)).