Please see below for additional information regarding our outage. We also encourage you to go to our community page if you’d like to share any comments or concerns at the below link:
In our attempt to improve how Agents Communicate with our product and increase the accuracy of agent connection reporting and allow for larger scale in the future we released changes to our architecture. A severe backup in our Agent Message Queues, specifically two subqueues used for scheduling and connections, caused significant delays to Agent execution and status reporting. This manifested in the following behaviors:
On 08/07/2020, after weeks of testing in our staging environment, we released the above changes to our architecture. This change caused an increase in the amount of messages that get queued in the Agent Connection Queue. This had very little impact initially as we were able to process through the increased load of messages in a timely manner. However by 8/10/2020, the Agent Connection Queue had reached 8.5 million queued messages. As fixes were implemented to unblock the initial 8.5m messages new message backups were created at different logic bottlenecks in our Agent Communication process.
Incident was resolved through a handful of fixes that were released during various parts of the outage. Ultimately, each fix enabled the processing of queued messages through relative portions of the Agent Communication Process. Additional fixes which increased efficiency were implemented until respective queues were restored to 0.
After restoration of the queues a cleanup effort was required in order to restore accurate status reporting to affected agents. This remediation effort took many hours to process through every endpoint but ultimately restored all endpoints to an accurate state
Adjusted alarm thresholds to notify appropriate resources before Agent Message Queues get passed an unmanageable amount. Additionally, we will be conducting additional analysis into ways we can improve efficiency in our Agent Communication logic. This will be an ongoing process and will be augmented as we continue to scale.