Post-Mortem Report – Timetracker Cloud Outage
Incident date: 2025-11-16
Service restored: 2025-11-17
Duration: ~14 hours (2025-11-16 ~18:00 CET → 2025-11-17 08:00 CET)
Timetracker Cloud became unavailable on 16 November 2025 at approximately 18:00 CET. The service remained inaccessible until 08:00 CET on 17 November, when full functionality was restored. This document summarises the sequence of events, the technical causes, and the measures being taken to prevent recurrence.
At around 18:00 CET, our Atlassian Cloud integration began experiencing intermittent network instability. During these periods, outbound communication from our application instance slowed significantly and, at times, stalled entirely. As requests accumulated, the application server came under sustained load, eventually becoming unresponsive.
An automated recovery mechanism attempted a restart of the application server. While the restart process executed, it did not successfully bring the application layer back online. As a result, the server appeared operational at the infrastructure level, but the application itself remained unavailable.
Due to the nature of the failure, our monitoring system detected only the server shutdown and subsequent restart, but did not identify that the application layer was still offline. This gap allowed the outage to persist until manual intervention the following morning.
Service was restored at approximately 08:00 CET on 17 November.
The outage was caused by a combination of factors:
Once the issue was identified on the morning of 17 November, engineers performed a full application-level restart and validated all dependent service connections. Normal operation resumed immediately afterwards.
We are implementing several improvements to reduce the likelihood and impact of similar incidents:
* Introducing additional buffering and more resilient handling of backlogged request queues when external network instability occurs.
* Refining the health-check logic so automated recovery attempts verify not only server availability but also full application-layer readiness.
* Expanding our monitoring to distinguish infrastructure-level restarts from incomplete application-level recoveries.
* Adding targeted alerts when request queues exhibit abnormal growth patterns.
These steps are underway and will be completed as part of our reliability improvement programme.
We sincerely apologise for this disruption and appreciate your patience while we work to ensure stronger resilience going forward. If you have further questions or require additional details, we are ready to assist.