AREX Agent Protection Mechanism
To minimize the risk of reducing the production environment using AREX to record data, Agent provides protection mechanisms, mainly queue overflow health detection and service availability detection for recording scenarios.
Steps for health detection of queue overflow
In high traffic situations or when recording frequency is high:
- All recording requests go into a queue first, then an asynchronous consumer takes the task from the queue and processes it (the default length of the queue is 1024).
- Before each recording (into the queue), it will determine if the queue is full, if full, it will stop recording immediately.
- 30 seconds later, start a health check task, first reduce the frequency (based on the current recording frequency by 20%), and re-open the recording. 4. 5 minutes later, check the queue.
- 5 minutes later check if the queue has been restored, if not then continue to reduce the frequency (by 20% based on the current recording frequency) until it is reduced to the lowest frequency (roughly once an hour).
- Check again after 10 minutes and stop the test if the queue is available again.
Steps for health checking the state of the storage service
If there is an exception detected by calling the arex-storage
service:
- If the
arex-storage
service call fails (including network anomaly) or thearex-storage
service is unavailable, stop recording immediately - 10 seconds later, re-open the recording and collect the service health indicators (99% of the requests do not have a timeout and are rejected by the service).
- 3 minutes later check if the queue has recovered based on the service health metrics, if it has not recovered then continue to reduce the frequency (by 20% based on the current recording frequency) until it is reduced to the lowest frequency (roughly once an hour), and then test it again 10 minutes later. If it recovers, the detection stops.