Timetracker Cloud Service Outage

Incident Report for Everit

Postmortem

Post-Mortem Report – Timetracker Cloud Outage
Incident date: 2025-11-16
Service restored: 2025-11-17
Duration: ~14 hours (2025-11-16 ~18:00 CET → 2025-11-17 08:00 CET)

1. Overview

Timetracker Cloud became unavailable on 16 November 2025 at approximately 18:00 CET. The service remained inaccessible until 08:00 CET on 17 November, when full functionality was restored. This document summarises the sequence of events, the technical causes, and the measures being taken to prevent recurrence.

2. What Happened

At around 18:00 CET, our Atlassian Cloud integration began experiencing intermittent network instability. During these periods, outbound communication from our application instance slowed significantly and, at times, stalled entirely. As requests accumulated, the application server came under sustained load, eventually becoming unresponsive.

An automated recovery mechanism attempted a restart of the application server. While the restart process executed, it did not successfully bring the application layer back online. As a result, the server appeared operational at the infrastructure level, but the application itself remained unavailable.

Due to the nature of the failure, our monitoring system detected only the server shutdown and subsequent restart, but did not identify that the application layer was still offline. This gap allowed the outage to persist until manual intervention the following morning.

Service was restored at approximately 08:00 CET on 17 November.

3. Root Cause

The outage was caused by a combination of factors:

Network degradation between the application instance and Atlassian Cloud, leading to stalled communication and excessive queueing of requests.
Application server saturation due to the accumulating backlog triggered by the network issue.
Incomplete recovery by the automated restart mechanism, which restarted the server but failed to verify that the application layer had fully recovered.
Insufficient monitoring coverage at the application layer, which prevented timely detection of the failed restart.

4. Impact

Timetracker Cloud was fully unavailable to all customers for the duration of the incident.
No data loss occurred.
No customer configurations, worklogs, or historical records were affected.

5. Resolution

Once the issue was identified on the morning of 17 November, engineers performed a full application-level restart and validated all dependent service connections. Normal operation resumed immediately afterwards.

6. Preventive Measures

We are implementing several improvements to reduce the likelihood and impact of similar incidents:

Mitigating network sensitivity

* Introducing additional buffering and more resilient handling of backlogged request queues when external network instability occurs.

Enhancing the restart automation

* Refining the health-check logic so automated recovery attempts verify not only server availability but also full application-layer readiness.

Strengthening monitoring

* Expanding our monitoring to distinguish infrastructure-level restarts from incomplete application-level recoveries.
* Adding targeted alerts when request queues exhibit abnormal growth patterns.

These steps are underway and will be completed as part of our reliability improvement programme.

7. Closing Note

We sincerely apologise for this disruption and appreciate your patience while we work to ensure stronger resilience going forward. If you have further questions or require additional details, we are ready to assist.

Posted Nov 17, 2025 - 15:21 CET

Resolved

Timetracker Cloud was temporarily unavailable to customers. The underlying cause is still under investigation, but service has already been restored and is fully accessible again. We are currently analysing the root cause and closely monitoring system behaviour.

Preliminary evidence suggests that the server experienced an unexpected shutdown and subsequent restart. Although the server itself came back online, the application layer did not recover as expected, resulting in service unavailability. Our monitoring tools detected the infrastructure interruption and successful restart, but did not indicate that the application remained offline.

We sincerely apologise for the disruption. Preventing a recurrence is our priority, and further details will be shared in the upcoming post-mortem.

Posted Nov 16, 2025 - 20:00 CET