RTP-LLM Performance Benchmark Tool#

In this chapter, I will present the performance testing tools developed in RTP-LLM, including standalone measurement of model prefill and decode performance under various batch sizes with single-node and multi-node parallelism, timeline recording, and their usage methods.

Design Principle#

RTP-LLM employs a special batch scheduler that accumulates requests until the specified batch size is reached, then all requests enter the engine simultaneously. The scheduler supports both prefill and decode modes; in decode mode, requests are only allocated KV cache without prefill, enabling accurate and efficient measurement of engine performance. In detail, the batch-scheduler profiler is executed three times for a single input:

  • Warm-up run, to account for one-time setup such as JIT compilation.

  • Timing run, to measure engine performance.

  • Profiling run, to capture a timeline for subsequent analysis; this step may degrade end-to-end performance.

For every run we use min_new_tokens and max_new_tokens to ensure that all requests perform the same number of decode steps.

Since in decode mode, we not prefill the KVCache, so hidden_states after every forward step is not real. So we also hack moe gate select for moe model and speculative accept func for mtp. In that way, we get stable result for analyse result.

Single-Node Benchmark#

using commands below can start a performance benchmark, mixed prefill and decode

bazelisk test //rtp_llm/test/perf_test:perf_test \
    --config=cuda12_6 \
    --test_arg=--ckpt_path=${/PATH/TO/CKPT} \
    --test_arg=--tokenizer_path=${/PATH/TO/TOKENIZER} \
    --test_arg=--model_type=${MODEL_TYPE} \
    --test_arg=--dp_size=1 \
    --test_arg=--tp_size=1 \
    --test_arg=--batch_size="1,2,4,8,16,32" \
    --test_arg=--input_len="128,1024,2048,4096" \
    --test_env=INT8_MODE=1 # optionally using --test_env to using custom env for RTP-LLM setup

specially batch_size states for the batch in single DP node, when DP_SIZE param is setted.

also we support test prefill or decode only when prefill and decode not share the same config(such as prefill use deepep normal, and decode use deepep masked), in that case user should also set --partial={0:all(default), 1:decode, 2:prefill}, below is an example of testing decode only:

    --test_arg=partial=1

Multi Node Benchmark#

We also provide a Python script to enable multi-node benchmark, but since it requires setting up the environment and starting the script on multiple machines, it still involves more steps than single-node testing.

For each machine to be tested, you need to create an environment in which RTP-LLM can run and support passwordless SSH access from the current machine’s port.

Benchmark Yaml#

Configure the parameters with reference to rtp_llm/test/perf_test/multi_node/multi_benchmark_config.yaml. Below is the detail explaination of yaml structure.

First part is for cloning code to local in ssh machine, and checkout to the branch you need to test

benchmarks:
  - name: "H20_Deepseek-R1_Decode_EP32_4K"
    # git config
    git_repo_url: "git@github.com:alibaba/rtp-llm.git"
    git_checkout_ref: "origin/main"

Second part describe the machine list, user name and port to ssh

    # machine config
    ip_lists:
      - "33.126.67.231"
      - "33.126.67.17"
      - "33.126.51.159"
      - "33.126.83.168"
    run_user: "admin"
    ssh_port: 2222
    # model config

Third part describe the model info which should be pre-download to local machine

    tokenizer_path: "/mnt/nas1/hf/deepseek_r1_4layers/"
    checkpoint_path: "/mnt/nas1/hf/deepseek_r1_4layers/"
    model_type: "deepseek3"

Fourth part describe the test cases, including prefill/decode, batch_size and input_len(they will be used as Cartesian product).Specially, tp_size and dp_size len should be equal as each tuple of them will be started as a parallel config. For example, in below config, script will start three server with TP=1 DP=32, TP=2 DP=16, TP=4 DP=8 and benchmark

    # test config
    is_decode: true
    batch_size_list: "[1,2,4,8,16,32,48,64,80]"
    input_len_list: "[4096]"
    tp_size: [1,2,4]
    dp_size: [32,16,8]

bazel_build_args is the flag for bazelisk build, if you want to test in AMD card, change --config=rocm

    # build config
    bazel_build_args: '" --jobs 100 --verbose_failures --config=cuda12_6 "'
    # file dir config
    ft_sub_dir: "rtp_llm_perf_test"

last part is the model env config, be careful that all env configs type should in [int, float, string, bool], or there maybe unexpected error

    # model config
    start_port: 12333
    concurrency_limit: 80
    accl_dispatch_num_warp_groups: 4
    accl_combine_num_warp_groups: 4
    decode_test_length: 2048
    warm_up: 1
    act_type: "bf16"
    weight_type: "fp16"
    reserver_runtime_mem_mb: 0
    device_reserve_memory_bytes: 0
    load_ckpt_num_process: 96
    max_context_batch_size: 1
    enable_merge_w13: true
    use_deepep_moe: true
    enable_layer_micro_batch: 2
    enable_comm_overlap: true
    redundant_expert: 0
    accl_low_latency_optimize: 1

Run step#

# in root of RTP-LLM
cd rtp_llm/test/perf_test/multi_node
# Configure multi_benchmark_config.yaml
# Firstly run use -s to download wheel deps
/opt/conda310/bin/python3 multi_benchmark.py -m run -s
# Second and after
/opt/conda310/bin/python3 multi_benchmark.py -m run
# Finally clean running dir in machine
/opt/conda310/bin/python3 multi_benchmark.py -m clean

Also, multi-node benchmark dumps profile json in rank0, but currently we don’t scp dir to local yet. So please go to rank0 and get profile data manually before clean step

Result Format#

Decode result, where batch size stands for per DP Rank

+---------------------------------------------------------------------------------------------+
|                                        Decode Result                                        |
+---------+------------+------------------+--------------+------------------+-----------------+
| Seq Len | Batch Size | Sucess/Total Req | Input/Output | Waiting Time(ms) | Decode Time(ms) |
+---------+------------+------------------+--------------+------------------+-----------------+
| 4096    | 1          | 1/1              | 4096/2048    | 0.00             | 1.75            |
| 4096    | 2          | 2/2              | 4096/2048    | 0.00             | 1.72            |
| 4096    | 4          | 4/4              | 4096/2048    | 0.00             | 1.91            |
| 4096    | 8          | 8/8              | 4096/2048    | 0.00             | 1.93            |
| 4096    | 16         | 16/16            | 4096/2048    | 0.00             | 2.00            |
| 4096    | 32         | 32/32            | 4096/2048    | 0.00             | 2.19            |
| 4096    | 48         | 48/48            | 4096/2048    | 0.00             | 2.43            |
| 4096    | 64         | 64/64            | 4096/2048    | 0.00             | 2.60            |
| 4096    | 80         | 80/80            | 4096/2048    | 0.00             | 2.87            |
+---------+------------+------------------+--------------+------------------+-----------------+