Technical Implementation of AREX Agent

AREX is an automated regression testing platform using Java Agent and bytecode enhancement technology to achieve traffic recording and replay, this post will share the specific implementation details of AREX Agent.

Background

Within Ctrip, with the increasing scale and complexity of the company’s business, the R&D and testing teams are facing a variety of performance dilemmas. Particularly in scenarios where a large amount of test data needs to be constructed, database write data needs to be validated, and frequent releases are required, ensuring the quality assurance of the business becomes a top priority.

To guarantee quality in the mode of continuous delivery(CD), we developed an automated test platform AREX for "regression testing with real production traffic and data", using Java Agent and bytecode enhancement technologies. AREX can record the request and response of entry and dependent in the production environment. Then simulate the request in the test environment, and verify the correctness of the logic of the entire call chain one by one.

With recording, replay and comparison, AREX effectively solves the issues of regression testing, making the testing process more efficient and accurate. After Ctrip's Airfare BU accessed the AREX platform, the efficiency of single release regression testing increased by 75%. At present, 85% of the core applications have been connected to AREX and put into regression testing, and other BUs, such as Hotels, Business travel, Car, Train tickets, platform R&D center, etc., are also being accessed to AREX gradually, with a cumulative total of 2,000+ apps and 18,000+ interfaces.

That is to say, AREX can replay in the test environment with real traffic on the line, and its features are as follows:

no code invasion of data collection and automation Mock, support for commonly used open source components: Dubbo, Http, Redis, persistence layer framework, configuration center recording and replay;
support a variety of complex business scenarios of verification, including multi-threaded concurrency, asynchronous callbacks, write operations and so on;
can directly use the production recorded data in the local replay, rapid reproduction of production bugs.

This article will share AREX Agent specific implementation details, aims to bring some inspiration and help developers. The Project open sourced at https://github.com/arextest/arex-agent-java

Process of starting up

Considering the access cost of the application, the access process of the AREX Agent must be non-intrusive and transparent to the user.

The process of accessing and starting the AREX Agent is shown in the following figure.

After the user selects AREX Agent service through the CI Pipeline, the AREX startup script arex-agent.sh will be typed into the distribution package when repackaging the image.
The startup script will pull the latest arex-agent.jar and mount the AREX Agent by modifying the application JVM Options.
After the JVM initialization, it will call the premain method to start the AREX Agent, pull the corresponding configuration of the application through the configuration service, load the AREX Agent plug-in according to the configuration on-demand, and carry out the bytecode enhancement when the application startup classes are loaded.
Currently, only one machine is allowed to start AREX Agent in a single cluster.

Process of Record and Replay

As shown in the figure below, a request typically has a chain of calls consisting of an entry point and dependencies that are either synchronous or asynchronous. The recording process is to connect the entr y and dependency calls through a RecordId to form a complete test case. AREX-Agent enhances the bytecode of the entry and dependency calls, intercepts the call process when the code is executed, and records the entry parameter, return value, and exceptions of the call, and sends them to the storage service.

When replaying in the test environment, the real data recorded in the production environment will be used to simulate the request, and the AREX Agent decides whether to replay or not. If playback is required, the AREX Agent does not make a real call to the method, but pulls the call response data saved by the storage service and returns it.

The following figure shows an example of the bytecode of a SOA Client synchronization call augmented by the AREX Agent, similar to the other components.

Technical challenges

The process of recording and replay is very complex. We encountered some challenges for different implementations of various applications. Next, we will share the technical details of how AREX Agent solves these problems.

ClassLoader Isolation and Interoperability

To ensure the AREX Agent code and dependencies do not have a conflict with the application code, the AREX Agent and application code are isolated by different class loaders. As shown in the figure below, AREX Agent overrides the findClass method by customizing AgentClassLoader to ensure that the classes used by AREX Agent will only be loaded by AgentClassLoader, so as to avoid conflicts with the application ClassLoader.

Meanwhile, in order to let the application ClassLoader recognize the recording and playback code of AREX Agent, AREX Agent injects the byte code needed for recording and playback into the application ClassLoader through the ByteBuddy ClassInjector to make sure there is no ClassNotFoundException/NoClassDefFoundError during recording and replay.

Tracing

When the data is recorded and replayed, the entry point of a request and the calls of each dependency will be linked together by a RecordId. In the face of multi-threaded and various asynchronous frameworks, the stringing of data brings a great challenge, and AREX Agent solves the problem of cross-threaded RecordId transfer by enhancing the threading. Currently supported threads and thread pools are as follows.

Thread
ThreadPoolExecutor
ForkJoinTask
FutureTask
FutureCallback
Reactor Framework
……

Here's a simple code example for better understanding of implementation, other solution ideas are similar.

When calling java.util.concurrent.ThreadPoolExecutor#execute(Runnable runnable), capture the current thread context when constructing AgentRunnableWrapper by wrapping the argument AgentRunnableWrapper runnable for wrap, capturing the current thread context when constructing the AgentRunnableWrapper, replacing the subthread context during the run method, and returning the subthread context after execution. The code example is as follows:

executors.execute(Runnable runnable)
executors.submit(Callable callable)

public void execute(Runnable var1) {
var1 =RunnableWrapper.wrap(var1);
}

public class RunnableWrapper implements Runnable {
  private final Runnable runnable;
  private final TraceTransmitter traceTransmitter;
  
  private RunnableWrapper(Runnable runnable){
    this.runnable = runnable;
    //Capture the current thread context 
    this.traceTransmitter = TraceTransmitter.create();
    }
  
  @Override
  public void run(){
    //Replacing the subthread context 
    try (TraceTransmitter tm = traceTransmitter.transmit()){
      (runnable.run();
    }
    //Reducing the Atomic Thread Context 
  }
}

...

Component Version Compatibility

There may be multiple versions of a component introduced into an application, and there may be incompatibilities between versions of the same component, such as: package changes, method additions or removals, etc. The AREX Agent needs to recognize the correct component version for bytecode enhancement to avoid duplicate enhancement or enhancement of the wrong version. In order to support multiple versions of a component, the AREX Agent needs to recognize the correct version of the component to perform bytecode enhancements to avoid duplicate enhancements or enhancements of the wrong version.

AREX Agent ensures that code enhancements are made to the correct version by recognizing the Name and Version of the META-INF/MANIFEST.MF inside the component jar package and performing a version match when the class is loaded.

MOCK of Local Cache

Object value = localCache.get(key)
//   Cache is available during recording, but not available during playback in which case the code needs to query the database (db.query()).
if (value != null) {
    return value;
} else {
    return db.query();
}

As shown above, during recording, the code first attempts to retrieve the value associated with the given key from the local cache (localCache.get(key)). If the value is not null, it means that the corresponding data is available in the cache during recording, and it is directly returned.

However, during playback, the cache is not available. Therefore, if the value retrieved from the cache is null, it means that the data is not present in the cache during playback. In this case, the code needs to query the database (db.query()) to retrieve the data and return it as the result.

In a word, the execution flow of the replay request is often different from the recording due to inconsistent local cache data with the recording, resulting a low pass rate of replay testing. There are a few challenges to solve this problem:

It is challenging to achieve real-time synchronization between production and test cache data due to the isolation between them.
Local memory is implemented in various ways, and it is impossible to perceive each one individually.
Local memory data is typically fundamental data and can have a large volume. Recording this data can lead to significant performance overhead.

Now the solution adopted by AREX Agent is to record only the cached data used in the current request link each time, and let the application configure dynamic classes to recognize the recording, and then automatically replace it when replaying in the test environment, so as to ensure the consistency of the memory data between recording and replaying. We are still researching the solution of recording large cache data. Anyone with experience in this area is welcome to discuss it with the community.

Time Mock

Many business systems are time-sensitive, where accessing them at different times can result in different outcomes. If the recording and playback times are inconsistent, it can lead to playback failures. Additionally, modifying the machine time on the test server is not suitable as playback requests are concurrent, and many servers do not allow modification of the current time. Therefore, we need to implement mocking of the current time at the code level to address this issue.

The currently supported time types are as follows:

java.time.Instant
java.time.LocalDate
java.time.LocalTime
java.time.LocalDateTime
java.util.Date
java.util.Calendar
org.joda.time.DateTimeUtils
java.time.ZonedDateTime

public static native long currentTimeMillis() is an intrinsic function. When the JVM performs inline optimization on intrinsic functions, it replaces the existing bytecode with internal code (JIT), which causes the enhanced code by AREX Agent to become ineffective. The JDK performs inline operations on System.currentTimeMillis() and System.nanoTime() as follows:

// https://hg.openjdk.org/jdk8u/jdk8u/hotspot/file/dae2d83e0ec2/src/share/vm/classfile/vmSymbols.hpp#l631
//------------------------inline_native_time_funcs--------------
// inline code for System.currentTimeMillis() and System.nanoTime()
// these have the same type and signature
bool LibraryCallKit::inline_native_time_funcs(address funcAddr, const char* funcName) {
  const TypeFunc* tf = OptoRuntime::void_long_Type();
  const TypePtr* no_memory_effects = NULL;
  Node* time = make_runtime_call(RC_LEAF, tf, funcAddr, funcName, no_memory_effects);
  Node* value = _gvn.transform(new ProjNode(time, TypeFunc::Parms+0));
#ifdef ASSERT
  Node* value_top = _gvn.transform(new ProjNode(time, TypeFunc::Parms+1));
  assert(value_top == top(), "second value must be top");
#endif
  set_result(value);
  return true;
}

AREX Agent has taken special care of this issue by replacing the code that uses the method System.currentTimeMillis() with AREX Agent's method of obtaining the time directly through the application configuration, avoiding inline optimizations.

Planning

We will focus on the following aspects of optimisation.

Improve replay efficiency and reduce the termination rate.
Improve experience and reduce user cost.
Improve the precision testing and reduce the amount of replay cases.
Comprehensive static/dynamic code analysis to achieve 100% online business scenario coverage.

AREX has been open-sourced at https://github.com/arextest. The community is currently harvesting nearly two hundred community users, including dozens of enterprise users, covering finance, Internet, manufacturing, e-commerce and so on. We hope that more contributors will join the community and enhance the company's technology brand influence.

Background​

Process of starting up​

Process of Record and Replay​

Technical challenges​

ClassLoader Isolation and Interoperability​

Tracing​

Component Version Compatibility​

MOCK of Local Cache​

Time Mock​

Planning​

Reference​