Mendix Cloud v3 EU - Physical server outage

Incident Report for Mendix Technology

Postmortem

After closely analysing the chain of events and the involved components we have come to the following conclusion.

  • A disk failure in the shared data storage system was handled in an unexpected manner by the storage system which resulted in issues with the stability of the network connections.
  • After eliminating any possible cause, resetting individual network connections and finally one of the core network switches resulted in regaining network stability.
  • Mendix is working with suppliers to resolve any possible underlying issue to prevent this incident from occurring again.

Please contact customer support should you require more information.

Posted Oct 25, 2019 - 15:18 UTC

Resolved

This incident has been resolved.
Posted Oct 24, 2019 - 15:42 UTC

Monitoring

The reboot of the networking infrastructure has alleviated the stability issue and the environment is performing as expected. We are monitoring the situation and will update the status when necessary.
Posted Oct 23, 2019 - 18:33 UTC

Update

The engineers are still working on restarting the networking infrastructure. We will post the next update in an hour.
Posted Oct 23, 2019 - 17:19 UTC

Update

The next step in the process of resolving this issue will be to restart all other components in the networking infrastructure. We will post the next update in an hour.
Posted Oct 23, 2019 - 16:16 UTC

Update

In cooperation with the supplier, engineers are currently working to clear, reset and restore the connections between the storage subsystem and the servers. We are analysing the results of this action and will post a next update in an hour.
Posted Oct 23, 2019 - 14:53 UTC

Investigating

The engineers working on the case have come to the conclusion after resolving a likely root cause that this did not resolve the stability issue which is causing the partial failure. We are continuing the investigations and will post a next update in 1 hour.
Posted Oct 23, 2019 - 13:27 UTC

Identified

The engineers working on this issue have isolated a likely root cause and are working towards a recovery scenario. We currently expect to see results of this at around 15:00 CEST. We will post an update on or before that time.
Posted Oct 23, 2019 - 11:43 UTC

Update

Unfortunately, we have no news to report. We are continuing investigating this issue with the supplier and will post another update within 1 hour.
Posted Oct 23, 2019 - 10:38 UTC

Update

A number of applications running on Mx Cloud V3 are experiencing slowdowns due to the partial storage failure.

We are continuing investigating this issue with the supplier and will post another update within 1 hour.
Posted Oct 23, 2019 - 09:30 UTC

Investigating

We are experiencing a partial failure on the storage subsystem in Mendix Cloud v3 EU. We're currently investigating the issue and will post an update within 1 hour of this posting.
Posted Oct 23, 2019 - 08:28 UTC