During a recent upgrade of a component in our Cloud V4 we sent out a large number of false positives on the Application Status or runtime heartbeat alert.
We use a blue-green process for releasing kapacitor (the alerting engine that generates alerts and sends them to our alerting database). In a blue-green process an instance of an application with the new version is spun op next to the currently active version of the application. Requests are then routed to the new application instance. After this, the old application instance is shut down.
To decrease the heartbeat timeout we ran ~20 releases. We ran this number because we were incrementally decreasing the timeout value in small steps while monitoring the alerts sent out to our customers. In one of these releases the blue-green process failed. The new alerting engine was spun up, but the old version of the alerting engine was not shut down. As a consequence spurious alerts were sent to our alerting database, causing load problems on our alerting database. This did not show up until later in the day, when we had already finished the actual release. We upscaled our alerting dataabase.
Although we communicated to our customers that they could expect false positives, the problem we experienced during the release were unrelated to the false positives we expected.
The corrective measures we will take: