Unexpected Outage (resolved)
We are currently offline due to an issue within our application hosting infrastructure. We have identified the cause of the issue and are working to resolve this.
We will provide any updates, including expected time for the service to be restored, as soon as we have more information.
UPDATE: (Wednesday 14th Nov, 9:57pm GMT) – We have the service back online and we will provide an update once our internal debrief is complete.
UPDATE (Wednesday 14th Nov, 10:38pm GMT) – A quick update with some further details on this outage. The initial cause was an issue within our application database server cluster, one of the active servers failed however it was still showing as operating normally which prevented failover to the redundant node. We will continue to investigate why the resiliency did not work as expected.
UPDATE: (Thursday 15th Nov, 9:30am GMT) – We have further instability as a result of reoccurrence of this issue. The system is stable and operating as expected again. Our operations team are closely monitoring the situation.
We apologise for the inconvenience caused to any customers by this outage. We continue to make a very significant investment in our hosting platform and resulting service availability level, which we have maintained at 99.99% since we launched Xero in 2007. It frustrates us if we have any unscheduled downtime, but we work hard to learn whatever we can from these incidents and use this to further improve and strengthen our platform going forward.
UPDATE (Thursday 15th Nov, 10:14pm GMT) – Yesterday we had issues with the stability of our platform resulting in four customer impacting outages totalling 47 minutes.
All of the outages related to our database layer which is critical for the operation of the application. We have a resilient database environment with hot standby servers and automatic failover, however yesterday this failover did not work how it has in the past or during regular testing.
Of the four events two were the result of a server becoming unresponsive without the expect failover and two were a result of excessive database load that did not require a failover.
Our investigations have focused on both why the database server became unresponsive and the reason that the resilience did not provide seamless failover.
- The database server became unresponsive due to unusual batch process load causing contention between the memory demands of the database and operating system. We have identified the processes at fault and have made the first change to remove the likelihood of this reoccurring. We have additional work underway to implement a permanent fix that prevents this issue occurring.
- Our database layer relies on Windows Clustering for resilience however this did not automatically failover as it should have. We have identified the reason this did not work as expected and verified the issue with the supplier.
We are in the later stages of a project to migrate away from Windows Clustering as part of a wider project to improve the resilience of our platform. We expect to make this change early next year.
Our operations team continuously analyse the platform and look for ways to improve reliability and performance. We treat any issues seriously and the team are very aware of the impact that system issues have on our customers. While we maintain a very high uptime we will continue to work to eliminate risk wherever possible.
Categories: Company News