We wanted to take the time to provide a transparent summary of an incident that affected our system from January 6th to the 14th. During this period, we experienced difficulties in retrieving package events and sending associated notifications.
The root cause of the issue was an ongoing upgrade to our database, which inadvertently put excessive read/write operations on the disk. This slowdown significantly impacted critical operational processes, including event retrieval and notification scheduling.
Before starting the migration, we deliberately targeted a low-activity period to minimize disruptions to our customers' experience. The production migration started on December 29th, but at first, we didn't see any issues with the system. However, it wasn't until January 6th that the processing of large tables caused all database operations to slow down dramatically.
Our team worked diligently to optimize operations and restore normal system performance over the course of several days. We initially believed that notifications were being sent despite delays in tracker event processing, only to discover on Thursday 9th later that no notifications had been sent since Tuesday 7th morning. We quickly addressed this issue and restored notification functionality on Thursday evening.
The migration operation was completed on January 14th, and our system is now back to normal regarding event retrieval and notification sending.
Lessons learned:
Our goal is to provide our customers with a reliable and trustworthy experience. We take incidents like this seriously and are continually working to improve our processes and communication channels.