On October 9, 2025, we experienced our customer portal showing the stale data and then disruptions to charging operations. We want to provide a transparent overview of what happened, how we responded, and the steps we are taking to prevent this from happening again.
## Timeline of Events
13:45: Replication issues detected between database instances. It was growing slowly but steadily and initially did not caught attention as a transient issue.
14:27: Our engineering team started to recycle most loaded backend systems to clean up potentially hanging database connections. No improvement observed.
14:40: High-traffic API endpoints temporarily disabled to reduce load. No improvement observed.
15:00: Web services temporarily shut down. The underlying problem persisted despite reduced load.
15:00: Charging control plane restarted. Load decreased, but the event backlog continued to grow.
15:10: Web services restored. Event backlog remained unchanged.
15:37: Portal services shut down. Event backlog remained unchanged.
16:03: Incident escalated to emergency status.
16:15: Online detection services temporarily shut down. Event backlog cleared and returned to normal levels.
16:30: All systems stabilized and resumed normal operations.
Root Cause Analysis
Our cloud database service stopped shipping transaction logs, causing the secondary database replica to fall behind. This occurred despite adequate resources being available to handle the incoming transaction volume and is clearly visible on the respective server metric.
We are still investigating the underlying cause of the log shipping failure. Current areas of investigation include:
Cloud infrastructure networking issues
Long-running database transactions caused by application behavior
We are actively working to reproduce the issue to better understand the failure conditions.
Preventative Measures and Follow-up
To prevent similar incidents and improve our response capabilities, we are implementing the following measures:
Infrastructure optimization: Reducing workload on our database infrastructure through architectural improvements.
Enhanced monitoring: Deploying additional observability tools to detect replication issues earlier.
Automated alerting: New alerts configured for replication lag and transaction log queue sizes to enable faster detection and response.
We are committed to continuous improvement and will provide updates as our investigation progresses.