Performance Profiling#
This document introduces how to do performance profiling for the framework.
Model Timeline#
The most straightforward way of analyzing model performance is to print the execution timeline of the model. RTP-LLM provides functionality to directly dump model execution timeline into perfetto trace json.
Per Request Timeline#
Add option "gen_timeline": True,
in generate_config
of request could generate timeline for current request. The complete request looks like
{
"prompt": "Hello, ",
"generate_config": {
"gen_timeline": True,
}
}
The generated timeline json file is located at the workdir where you started rtp-llm.
Service-Wide Timeline#
You may use start arg --gen_timeline_sync
or specify env GEN_TIMELINE_SYNC=1
to enable service level timeline profiling. When this option is used, every model request would generate a profiling timeline.
How to visualize timeline#
Open perfetto ui to load the timeline json and visualize. You may also use perfetto sql to perform quantized analyzation.
Query SQLs Example#
Here are some query sqls that are useful for performance analyze.
stat are kernel latencies
SELECT
ROUND(AVG(dur) / 1000, 2) AS avg_us,
MAX(dur) / 1000 AS max_us,
MIN(dur) / 1000 AS min_us,
ROUND(SUM(dur) / 1000.0 / 1000, 2) AS sum_ms,
ROUND(SUM(dur) * 100.0 / (SELECT SUM(dur) FROM slice WHERE (category = 'kernel' OR category = 'gpu_memcpy')), 1) AS percent,
COUNT(name) AS count,
name
FROM
(SELECT * FROM slice)
WHERE
(category = 'kernel' OR category = 'gpu_memcpy')
GROUP BY
name
ORDER BY
SUM(dur) DESC;
analyze empty time slots
WITH cte AS (
SELECT name, ts, dur,
ROW_NUMBER() OVER (ORDER BY ts) AS seqnum
FROM slice
WHERE (category = 'kernel' OR category = 'gpu_memcpy')
)
SELECT slice.ts, slice.name, tprev.name,
slice.ts - COALESCE(tprev.ts, 0) - tprev.dur AS diff
FROM cte slice
LEFT OUTER JOIN cte tprev
ON slice.seqnum = tprev.seqnum + 1
ORDER BY diff DESC
sum up all empty slots on timeline
SELECT SUM(diff)
FROM (
WITH cte AS (
SELECT
name,
ts,
dur,
ROW_NUMBER() OVER (ORDER BY ts) AS seqnum
FROM slice
WHERE (category = 'kernel' OR category = 'gpu_memcpy')
)
SELECT
slice.name,
slice.ts - COALESCE(tprev.ts, 0) - tprev.dur AS diff
FROM cte slice
LEFT OUTER JOIN cte tprev
ON slice.seqnum = tprev.seqnum + 1
)
use nsight#
You may also use NVIDIA Nsight or other hardware-manufacturer provided profiling tools.
To enable nsight profiling with rtp-llm, you could add nsys binary and options before the start command you actually execut:
/opt/nvidia/nsight-systems/2025.1.1/bin/nsys profile \
-c cudaProfilerApi \
-b none \
--wait=primary \
--cpuctxsw=none \
--sample=none \
--trace='cuda,nvtx' \
--trace-fork-before-exec=true
/opt/conda310/bin/python -m rtp_llm.start_server
or use bazel --run_under
option to run a bazel target with nsight profiling
bazelisk test //rtp_llm/cpp/normal_engine/test:engine_test --config=cuda12_6 \
--run_under="/usr/local/cuda/bin/nsys profile \
--sampling-period 125000 \
--trace='cuda,nvtx,osrt,cublas,cudnn' \
--trace-fork-before-exec=true \
-o /tmp/report.rep"