Continuing delayed and missed patch windows

Incident Report for Automox

Postmortem

Please see below for additional information regarding our outage. We also encourage you to go to our community page if you’d like to share any comments or concerns at the below link:

https://community.automox.com/t/root-cause-analysis-for-the-august-10th-11th-outage/1182

Issue Summary

In our attempt to improve how Agents Communicate with our product and increase the accuracy of agent connection reporting and allow for larger scale in the future we released changes to our architecture. A severe backup in our Agent Message Queues, specifically two subqueues used for scheduling and connections, caused significant delays to Agent execution and status reporting. This manifested in the following behaviors:

Delay in execution of all policies and scans
The system would show as constantly patching, scanning, and/or initializing
Inaccurate reporting of connection states
Freshly installed agents would take awhile to show up in the Dashboard for Organizations.

Root Cause

On 08/07/2020, after weeks of testing in our staging environment, we released the above changes to our architecture. This change caused an increase in the amount of messages that get queued in the Agent Connection Queue. This had very little impact initially as we were able to process through the increased load of messages in a timely manner. However by 8/10/2020, the Agent Connection Queue had reached 8.5 million queued messages. As fixes were implemented to unblock the initial 8.5m messages new message backups were created at different logic bottlenecks in our Agent Communication process.

Resolution

Incident was resolved through a handful of fixes that were released during various parts of the outage. Ultimately, each fix enabled the processing of queued messages through relative portions of the Agent Communication Process. Additional fixes which increased efficiency were implemented until respective queues were restored to 0.

After restoration of the queues a cleanup effort was required in order to restore accurate status reporting to affected agents. This remediation effort took many hours to process through every endpoint but ultimately restored all endpoints to an accurate state

Next Steps

Adjusted alarm thresholds to notify appropriate resources before Agent Message Queues get passed an unmanageable amount. Additionally, we will be conducting additional analysis into ways we can improve efficiency in our Agent Communication logic. This will be an ongoing process and will be augmented as we continue to scale.

Posted Aug 20, 2020 - 23:46 UTC

Resolved

At this time we have completed our cleanup effort and and have confirmed normal operations have resumed. We apologize for the inconvenience caused and will be following up during normal business hours to address any inquiries or concerns.

Posted Aug 12, 2020 - 05:06 UTC

Update

We are getting towards the end of our cleanup effort and will continue to monitor as things progress. We will provide another update by 11:30 PM MDT.

Posted Aug 12, 2020 - 03:24 UTC

Update

Cleanup is now about 2/3 of the way complete. We will continue to monitor and provide an additional update by 9:00 PM MDT.

Posted Aug 12, 2020 - 01:39 UTC

Monitoring

Cleanup is well underway, however, we expect this process to take a few hours. Some devices may display inaccurate statuses until this is complete. We will continue to monitor the cleanup process and provide an additional update by 8:00 PM MDT.

Posted Aug 11, 2020 - 22:17 UTC

Update

We are continuing to work on cleanup and will provide updates as things progress.

Posted Aug 11, 2020 - 21:18 UTC

Update

We have implemented a fix for the root cause and are now working on restoring accurate Agent Status Reporting. Agent responsiveness should now be back to normal, however, customers may still experience the following issues:

- Endpoints seemingly stuck in "initializing" or "scanning" states
- Endpoints showing as disconnected when they are online

Endpoints that are not currently in the above listed states should operate as expected at this time.

Posted Aug 11, 2020 - 20:08 UTC

Update

We are continuing to work on a fix for this issue and will continue to update here as things progress.

Posted Aug 11, 2020 - 18:35 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Aug 11, 2020 - 15:39 UTC

Investigating

We are currently experiencing problems which are causing the following issues:

- delayed processing of commands sent to agents
- delayed onboarding of new systems
- possible execution of policies outside of scheduled times
- systems seemingly stuck in "initializing" or "scanning" states

We believe these issues have persisted overnight and apologize for the misrepresentation on our status page. We are working diligently to resolve these issues as soon as possible and will continue to provide updates as things progress.

Posted Aug 11, 2020 - 15:08 UTC

This incident affected: Console (web) and Patching Engine.